Seeking KDC Priority/Weight Clarification/Recommendation
Ken Raeburn
raeburn at MIT.EDU
Thu Dec 18 13:56:15 EST 2008
On Dec 18, 2008, at 10:02, Mark T. Valites wrote:
> Unfortunately, we were in a rush to restore service & didn't get the
> opportunity to investigate in depth. The most detailed issue summary
> I can
> give is just "users weren't able to log in". I do not have detailed
> information on the behaviors of each level of our authentication
> stack,
> but suspect that we were lucky enough that all the upstream kerb
> consumers
> were hitting the downed kdc, only. We also unfortunately don't
> currently
> have the resources to dedicate to an appropriate post-mortem
> investigation.
If you've got a particular client machine you know was exhibiting the
problem consistently, you may be able to simulate it by adding a
firewall or routing table entry just for that machine or subnet, to
prevent it from getting packets back from the problematic KDC.
Depending on the nature of the setup and the problem you were seeing,
you might either need the packets to just disappear without a
(visible) response, or you might need a "host unreachable" answer to
come back.
As I indicated before, if it's just one server offline, the effect
ought to be no worse than a one-second delay per exchange (and only 1/
N of the exchanges if you're using equal-weighted SRV records or the
addresses for the name in the config file really are returned in round-
robin fashion).
If it was worse than that -- and "can't log in" sounds worse -- it
sounds like there may be issues with the other KDCs as well, like not
having the KDC processes actually running, or firewall rules
accidentally blocking their traffic (incoming or outgoing), or
something like that, so that they couldn't pick up the work when the
main KDC went offline.
You might also want to experiment with setting a config file to list
names for individual KDCs one at a time instead of the shared name
with multiple addresses, just to verify that you can get answers back
from them.
> In looking at this more, I wonder if having both the default_realm
> in the
> libdefault section & a round robin kdc record explicitely defined in
> the
> realms section could be problematic - one of our kerb clients
> doesn't have
> any kdc entry in their realms section & saw no issues during the
> hardware
> failure.
If it really is round-robin, it's probably okay, but I wouldn't assume
that multiple A records are handed back in a round-robin fashion
without testing it. (And make sure you're testing what getaddrinfo
gets back on a machine that may do local caching of DNS data -- if the
machine reuses data from the cache in the same order each time, it
doesn't matter if the upstream DNS server would have changed the
address order on the next query.)
For that matter, you could test it out by running tcpdump (or similar
tools) and watching what happens as you make multiple requests from
your KDC, without needing to simulate a KDC being down. Does it
always go to the same KDC address, or does it randomly select between
them? That should be easy enough to test quickly, before you have
another problem with the main KDC machine.
If you don't mind doing a build of the MIT 1.x release -- whatever
version is in use on the client -- or fetching and building Red Hat's
sources, we've got a test program that prints out the address list
that would be used for the KDCs. After building and installing, go
into lib/krb5/os and run "make t_locate_kdc". Then you can run that
program with the realm name, and it'll print an address list, with a
bunch of debug information:
$ lib/krb5/os/t_locate_kdc ATHENA.MIT.EDU
in module_locate_server
ran off end of plugin list
module_locate_server returns -1765328135
looking in krb5.conf for realm ATHENA.MIT.EDU entry kdc; ports 88,750
config file lookup failed: Profile relation not found
sending DNS SRV query for _kerberos._udp.ATHENA.MIT.EDU.
walking answer list:
port=88 host=KERBEROS-2.MIT.EDU.
adding hostname KERBEROS-2.MIT.EDU., ports 88,0, family 0, socktype 2
setting element 0
count is now 1:
port=88 host=KERBEROS.MIT.EDU.
adding hostname KERBEROS.MIT.EDU., ports 88,0, family 0, socktype 2
setting element 1
count is now 2:
port=88 host=KERBEROS-1.MIT.EDU.
adding hostname KERBEROS-1.MIT.EDU., ports 88,0, family 0, socktype 2
setting element 2
count is now 3:
[end]
sending DNS SRV query for _kerberos._tcp.ATHENA.MIT.EDU.
krb5int_locate_server found 3 addresses
3 addresses:
0: address 18.7.7.77 dgram port 88
1: address 18.7.21.144 dgram port 88
2: address 18.7.21.119 dgram port 88
$
Unfortunately the debugging hooks aren't available in the production
build.
> I suspect this:
>
> [libdefaults]
> default_realm = ourrealm.ourdomain.edu
>
> [realms]
> dce.buffalo.edu = {
> kdc = kerberos.ourdomain.edu
> admin_server = kadminserver.ourdomain.edu
> }
>
> Should really be this:
>
> [libdefaults]
> default_realm = ourrealm.ourdomain.edu
>
> [realms]
> dce.buffalo.edu = {
> admin_server = kadminserver.ourdomain.edu
> }
>
> Could that make a difference?
It could, but the way you've described it, I would think both versions
would work.
And like I said, if you're seeing more than a one-second delay there's
probably more going wrong than just ordering of addresses.
Ken
More information about the Kerberos
mailing list