Seeking KDC Priority/Weight Clarification/Recommendation

Thu Dec 18 10:02:53 EST 2008

On Mon, 15 Dec 2008, Ken Raeburn wrote:

> On Dec 15, 2008, at 15:04, Mark T. Valites wrote:
>> We recently saw a hardware failure on one our (non-master) KDCs that 
>> brought the box completely off line. Even though they are 
>> redunant/highly available on their own, the downstream ldap & 
>> shibboleth/cosign servers all immeadiately saw issues because of this, 
>> exposing the weak link in the chain.
>
> What sort of issues?

Unfortunately, we were in a rush to restore service & didn't get the 
opportunity to investigate in depth. The most detailed issue summary I can 
give is just "users weren't able to log in". I do not have detailed 
information on the behaviors of each level of our authentication stack, 
but suspect that we were lucky enough that all the upstream kerb consumers 
were hitting the downed kdc, only. We also unfortunately don't currently 
have the resources to dedicate to an appropriate post-mortem 
investigation.

> The MIT code will pick one KDC address to try contacting, and if it 
> doesn't answer within a second, it will try the next one.  (Both the 
> config-file and DNS approaches as you described them would result in a 
> list of four addresses.  The library code should randomize the order of 
> all SRV records returned with weights all zero.  However, if the config 
> file version is used, the addresses will be tried in the order returned 
> by the getaddrinfo() function, and the hostnames listed in the file, if 
> more than one, are assumed to be in priority order so they're not 
> randomized.)
>
> So if you're getting random or rotating ordering of address records 
> returned, then with one server (address) of four unreachable, one 
> quarter of the time you should see a delay of a second.  If 
> getaddrinfo() or your DNS cache is being clever and trying to give you 
> an order optimized for proximity or some such, you may see delays more 
> often or less often, but the delay should still be no more than a 
> second.  If it is, you could try monitoring the network traffic with 
> tcpdump and see what it's doing in terms of trying to reach the various 
> KDCs.

In looking at this more, I wonder if having both the default_realm in the 
libdefault section & a round robin kdc record explicitely defined in the 
realms section could be problematic - one of our kerb clients doesn't have 
any kdc entry in their realms section & saw no issues during the hardware 
failure.

I suspect this:

[libdefaults]
         default_realm = ourrealm.ourdomain.edu

[realms]
         dce.buffalo.edu = {
                 kdc = kerberos.ourdomain.edu
                 admin_server = kadminserver.ourdomain.edu
         }

Should really be this:

[libdefaults]
         default_realm = ourrealm.ourdomain.edu

[realms]
         dce.buffalo.edu = {
                 admin_server = kadminserver.ourdomain.edu
         }

Could that make a difference?

-- 
Mark T. Valites
Senior Systems Administrator
Enterprise Infrastructure Services
University at Buffalo