Kerberos, DNS and AAAA records

Tue May 26 14:14:54 EDT 2009

On Fri, May 22, 2009 at 12:58 AM, Ken Raeburn <raeburn at mit.edu> wrote:
>
> The simple version is, we fire off a UDP message to one KDC, and after one
> second if we haven't heard back we assume the KDC is probably unreachable or
> offline and send a message to the next KDC, and so on.  However, in case the
> KDC is just being slow, we keep listening for responses even after we've
> moved on to the next KDC, unless we get back some kind of "port unreachable"
> indication.  In case we're having connectivity or packet-loss problems, we
> make a total of three passes through the list, with a little delay in
> between passes, resending UDP messages each time through.  For TCP, it's a
> little different -- we tell the kernel to start a connection in non-blocking
> mode, and if it connects, we start sending data, but the "quit" condition
> for getting out of the loop is successfully sending all of the data and
> getting a complete response back.  In the passes after the first, we don't
> do anything new for TCP, just keep trying to send and receive data.
>
> Once the name resolution is done, the worst-case timeout for the overall
> operation (if there are no responses including no host-unreachable errors
> back) should be according to the comment in src/lib/krb5/os/sendto_kdc.c:
>
>  * Per UDP server, 1s per pass.
>  * Per TCP server, 1s.
>  * Backoff delay, 2**(P+1) - 2, where P is total number of passes.
>  *
>  * Total = 2**(P+1) + U*P + T - 2.
>  *
>  * If P=3, Total = 3*U + T + 14.
>
> Of course, if getting the DNS data takes us a long time, that may dominate,
> and we haven't done much to improve that, though I've thought about some of
> these issues before.

I was piqued by this and did some more extensive testing as how the
best possible KDC candidate is determined out of a list.  I placed the
nearest KDC on the top of the list but for some strange reason a KDC
located quite some distance away (alright in another continent) is the
chosen one.  Ran the typical network diagnostics such as traceroute,
ping etc to determine the round trip time and nearest KDC is really
the quickest to respond to these type of test.  But, I cant generalize
this for kerberos services or dont know the inner workings well,
kerberos may have its unique way of determining the appropriate KDC
not just predicated on round trip time or the number of hops.  So
still not quite sure whats happening here.  FWIW, I'm using Solaris-10
natively packaged Kerberos.

The slowness gets further aggravated when trying to change a passwd
despite the explicit mention of admin_server and kpasswd_protocol is
specified with SET_CHANGE to take care of non SEAMlessness of
Windows/AD.

I cant really seem to get around slow logins etc.  I will be glad to
use any ideas here.

> It's a somewhat clunky mechanism that hasn't really been tuned using real
> network data, but most of the time it seems to do okay.  We have had
> complaints that we should be able to impose an overall total timeout.  And
> occasionally doing the serialized DNS queries before any of the connection
> attempts is a problem at some sites.  Like I said earlier, we can look at
> integrating the DNS lookups and the contact loops so each host address is
> looked up when we first want to contact it, but only if the extra complexity
> is really going to help.  Doing DNS queries asynchronously does not seem to
> be something we can do portably at the moment.
>
> We don't want to actually fire off all the queries at the same time, because
> usually a nearby KDC *will* respond quickly.  Sending off all the queries at
> once will make all the KDCs do the same work each time, even though only one
> response is needed, and eliminating any load-sharing benefit.
>
>
> If SRV records are used, hosts listed with equal priority are used in random
> order, as per the spec, since the Kerberos library has no additional
> information for sorting them by proximity.  We also have no hooks at the
> moment for figuring out and recording how responsive any given KDC is, to
> optimize later queries.  (A patch to allow some simple optimizations might
> be acceptable to MIT.  Some possible heuristics: Scan the local network
> interfaces and put anything on a directly-attached network ahead of anything
> further away.  Check an entry in the config file for network blocks listed
> in priority order, e.g., "18.0.0.0/8 2001:4830:2446::/48", so the local site
> can be described.  Or, you can try using the service-location plugin
> interface in the library to provide code to order things however you like,
> and maybe experiment with some heuristics without having to recompile the
> krb5 libraries.)
>
> You could set up the config files differently at each location, putting the
> nearby KDCs at the top of the list, and maybe only listing some of the
> others.  You could define a DNS name that maps to multiple addresses for
> several KDCs and list that as a lower-priority KDC to use as a fallback;
> that'll reduce the number of DNS queries needed, but if you've got local
> name servers at each site, caching should make the name lookups reasonably
> efficient.  You could also play games with anycast addresses to find the
> nearest KDC out of a set (either all your KDCs, or broken into two or three
> subsets if there are a lot), though that's probably serious overkill.