Seeking KDC Priority/Weight Clarification/Recommendation
Mark T. Valites
mvalites at buffalo.edu
Thu Dec 18 10:02:53 EST 2008
On Mon, 15 Dec 2008, Ken Raeburn wrote:
> On Dec 15, 2008, at 15:04, Mark T. Valites wrote:
>> We recently saw a hardware failure on one our (non-master) KDCs that
>> brought the box completely off line. Even though they are
>> redunant/highly available on their own, the downstream ldap &
>> shibboleth/cosign servers all immeadiately saw issues because of this,
>> exposing the weak link in the chain.
>
> What sort of issues?
Unfortunately, we were in a rush to restore service & didn't get the
opportunity to investigate in depth. The most detailed issue summary I can
give is just "users weren't able to log in". I do not have detailed
information on the behaviors of each level of our authentication stack,
but suspect that we were lucky enough that all the upstream kerb consumers
were hitting the downed kdc, only. We also unfortunately don't currently
have the resources to dedicate to an appropriate post-mortem
investigation.
> The MIT code will pick one KDC address to try contacting, and if it
> doesn't answer within a second, it will try the next one. (Both the
> config-file and DNS approaches as you described them would result in a
> list of four addresses. The library code should randomize the order of
> all SRV records returned with weights all zero. However, if the config
> file version is used, the addresses will be tried in the order returned
> by the getaddrinfo() function, and the hostnames listed in the file, if
> more than one, are assumed to be in priority order so they're not
> randomized.)
>
> So if you're getting random or rotating ordering of address records
> returned, then with one server (address) of four unreachable, one
> quarter of the time you should see a delay of a second. If
> getaddrinfo() or your DNS cache is being clever and trying to give you
> an order optimized for proximity or some such, you may see delays more
> often or less often, but the delay should still be no more than a
> second. If it is, you could try monitoring the network traffic with
> tcpdump and see what it's doing in terms of trying to reach the various
> KDCs.
In looking at this more, I wonder if having both the default_realm in the
libdefault section & a round robin kdc record explicitely defined in the
realms section could be problematic - one of our kerb clients doesn't have
any kdc entry in their realms section & saw no issues during the hardware
failure.
I suspect this:
[libdefaults]
default_realm = ourrealm.ourdomain.edu
[realms]
dce.buffalo.edu = {
kdc = kerberos.ourdomain.edu
admin_server = kadminserver.ourdomain.edu
}
Should really be this:
[libdefaults]
default_realm = ourrealm.ourdomain.edu
[realms]
dce.buffalo.edu = {
admin_server = kadminserver.ourdomain.edu
}
Could that make a difference?
--
Mark T. Valites
Senior Systems Administrator
Enterprise Infrastructure Services
University at Buffalo
More information about the Kerberos
mailing list