Seeking KDC Priority/Weight Clarification/Recommendation

Thu Dec 18 13:56:15 EST 2008

On Dec 18, 2008, at 10:02, Mark T. Valites wrote:
> Unfortunately, we were in a rush to restore service & didn't get the
> opportunity to investigate in depth. The most detailed issue summary  
> I can
> give is just "users weren't able to log in". I do not have detailed
> information on the behaviors of each level of our authentication  
> stack,
> but suspect that we were lucky enough that all the upstream kerb  
> consumers
> were hitting the downed kdc, only. We also unfortunately don't  
> currently
> have the resources to dedicate to an appropriate post-mortem
> investigation.

If you've got a particular client machine you know was exhibiting the  
problem consistently, you may be able to simulate it by adding a  
firewall or routing table entry just for that machine or subnet, to  
prevent it from getting packets back from the problematic KDC.   
Depending on the nature of the setup and the problem you were seeing,  
you might either need the packets to just disappear without a  
(visible) response, or you might need a "host unreachable" answer to  
come back.

As I indicated before, if it's just one server offline, the effect  
ought to be no worse than a one-second delay per exchange (and only 1/ 
N of the exchanges if you're using equal-weighted SRV records or the  
addresses for the name in the config file really are returned in round- 
robin fashion).

If it was worse than that -- and "can't log in" sounds worse -- it  
sounds like there may be issues with the other KDCs as well, like not  
having the KDC processes actually running, or firewall rules  
accidentally blocking their traffic (incoming or outgoing), or  
something like that, so that they couldn't pick up the work when the  
main KDC went offline.

You might also want to experiment with setting a config file to list  
names for individual KDCs one at a time instead of the shared name  
with multiple addresses, just to verify that you can get answers back  
from them.

> In looking at this more, I wonder if having both the default_realm  
> in the
> libdefault section & a round robin kdc record explicitely defined in  
> the
> realms section could be problematic - one of our kerb clients  
> doesn't have
> any kdc entry in their realms section & saw no issues during the  
> hardware
> failure.

If it really is round-robin, it's probably okay, but I wouldn't assume  
that multiple A records are handed back in a round-robin fashion  
without testing it.  (And make sure you're testing what getaddrinfo  
gets back on a machine that may do local caching of DNS data -- if the  
machine reuses data from the cache in the same order each time, it  
doesn't matter if the upstream DNS server would have changed the  
address order on the next query.)

For that matter, you could test it out by running tcpdump (or similar  
tools) and watching what happens as you make multiple requests from  
your KDC, without needing to simulate a KDC being down.  Does it  
always go to the same KDC address, or does it randomly select between  
them?  That should be easy enough to test quickly, before you have  
another problem with the main KDC machine.

If you don't mind doing a build of the MIT 1.x release -- whatever  
version is in use on the client -- or fetching and building Red Hat's  
sources, we've got a test program that prints out the address list  
that would be used for the KDCs.  After building and installing, go  
into lib/krb5/os and run "make t_locate_kdc".  Then you can run that  
program with the realm name, and it'll print an address list, with a  
bunch of debug information:

$ lib/krb5/os/t_locate_kdc ATHENA.MIT.EDU
in module_locate_server
ran off end of plugin list
module_locate_server returns -1765328135
looking in krb5.conf for realm ATHENA.MIT.EDU entry kdc; ports 88,750
config file lookup failed: Profile relation not found
sending DNS SRV query for _kerberos._udp.ATHENA.MIT.EDU.
walking answer list:
	port=88 host=KERBEROS-2.MIT.EDU.
adding hostname KERBEROS-2.MIT.EDU., ports 88,0, family 0, socktype 2
setting element 0
	count is now 1:
	port=88 host=KERBEROS.MIT.EDU.
adding hostname KERBEROS.MIT.EDU., ports 88,0, family 0, socktype 2
setting element 1
	count is now 2:
	port=88 host=KERBEROS-1.MIT.EDU.
adding hostname KERBEROS-1.MIT.EDU., ports 88,0, family 0, socktype 2
setting element 2
	count is now 3:
[end]
sending DNS SRV query for _kerberos._tcp.ATHENA.MIT.EDU.
krb5int_locate_server found 3 addresses
3 addresses:
  0: address 18.7.7.77	dgram	port 88
  1: address 18.7.21.144	dgram	port 88
  2: address 18.7.21.119	dgram	port 88
$

Unfortunately the debugging hooks aren't available in the production  
build.

> I suspect this:
>
> [libdefaults]
>         default_realm = ourrealm.ourdomain.edu
>
> [realms]
>         dce.buffalo.edu = {
>                 kdc = kerberos.ourdomain.edu
>                 admin_server = kadminserver.ourdomain.edu
>         }
>
> Should really be this:
>
> [libdefaults]
>         default_realm = ourrealm.ourdomain.edu
>
> [realms]
>         dce.buffalo.edu = {
>                 admin_server = kadminserver.ourdomain.edu
>         }
>
> Could that make a difference?

It could, but the way you've described it, I would think both versions  
would work.

And like I said, if you're seeing more than a one-second delay there's  
probably more going wrong than just ordering of addresses.

Ken