KDC query client performance

Sun Feb 13 16:52:24 EST 2011

We've been looking into some cases where MIT krb5 imposes unreasonable
performance penalties on scenarios where krb5 doesn't even wind up
getting used.  For instance, in one scenario, turning on ssh's
GSSAPIKeyExchange feature caused 96 DNS requests and 12 KDC requests
to conclude that there was no krb5 support for a target host on a
local network, for a delay of about four seconds.

As a first step, I've restructured the locate/sendto code so that we
don't resolve hostnames until we need them.  (I haven't yet extended
the KDC location module to be able to take advantage of this support.)

Some other steps we'd like to consider:

1. Turn off the realm walk on the client by default.  This is the
logic where the client assumes that (a) cross-realm key sharing is
most likely to be arranged along the domain hierarchy of realms, and
(b) the local KDC is only smart enough to return a cross-tgt for the
realm we ask for, not for an intermediate realm.  The second
assumption is no longer likely to be true; for quite a long time now,
KDCs have been smart enough to perform the realm walk internally and
respond with a TGT referral.  The down side of the realm walk is that
we commonly make three or more KDC queries to determine that a guessed
target realm doesn't exist within the local realm's federation.

It would actually be nice to eliminate this support entirely, as it's
a big source of complexity in the TGS request code.  But a more
conservative first step is to turn it off and allow it to be turned
back on.

2. Speeding up the client retry loop, so that it doesn't take as long
to time out when you're behind a firewall which black-holes port 88.
Currently we wait one second per UDP address per pass (and per TCP
address on the first pass), and also wait 2s/4s/8s/16s (or 30s in
total) at the end of each pass.

In order to be nice to KDC load, I think it's still prudent to wait
one second per server address on the first pass.  After that we're
mostly trying to be nice to the network, and networks have gotten much
faster.  So I think once we reach the end of the first pass, we ought
to speed everything up by a factor of ten--that is, wait only 100ms
between UDP queries on the second and later passes, and wait
200ms/400ms/800ms/1600ms at the end of passes.

3. Eliminate the second default UDP port (750) when parsing profile
kdc entries.  When a KDC is inaccessible, this causes extra delays,
and also extra DNS requests due to the way the code is structured.  We
have always restricted the second default port to UDP over IPv4,
likely because it was intended as a krb4 transition measure.

Unfortunately, this change is likely to break a handful of deployments
which happen to serve KDC requests only on port 750 and win because
they only need it to work over IPv4 UDP (and don't have any Heimdal
clients, or configure their Heimdal clients to use port 750
explicitly).  I'm not sure if it's worth not breaking these
environments at the cost of extra delays in more common cases.