hostname lookup performance in libkrb5, and threads
Ken Raeburn
raeburn at MIT.EDU
Wed May 16 03:08:12 EDT 2007
I've been thinking about this a bit more. (Short recap: If you have
a lot of KDC hosts listed, or a slow or lossy net connection between
you and the DNS server, the name->address resolution for the list of
KDC hosts can take a long time, and we wait for it to complete before
trying to contact any KDC. We also sometimes look up hostnames after
getting the response.)
I see three approaches to tackling this problem, which are not
mutually exclusive. I could use some feedback, especially on the
threading issues.
* Once we get an address for any KDC, start trying to contact it, and
as long as we don't get responses back, try other addresses as they
become available to us. Integrating the lookup and send/receive
loops like this doesn't save us any time in the worst case, but in
the best case, we may only look up one or two KDC names in DNS before
getting our response back.
The basic loop design would be along the lines of: In the first pass
through trying to contact all KDCs, keep a list of names, and a list
of already-acquired addresses. If we haven't received a response
from a KDC we've sent to already, and we have more addresses left
that we haven't sent to yet, send to the next address, and wait for a
response. If we haven't received a response, and we're at the end of
the address list, look up the next name, and add the addresses to the
list; then, without waiting, go back to checking for a response and
sending to the next address. The passes after the first would run
just as they do now.
* Run multiple name lookups in parallel. While the GNU C library has
had getaddrinfo_a for a while (and it's a popular enough platform
that it may be worth supporting this solution specifically), there
isn't a widely available solution of that sort. (I've run across a
couple of projects working on async hostname lookup client code, but
the code is generally under the GPL, so we can't drop copies into the
MIT libraries. Some projects do async DNS queries only, which
bypasses /etc/hosts and anything else that might be configured
locally, so they would only be helpful for DNS SRV and TXT queries.)
But I'm wondering now whether, on platforms where the standard
libraries include thread support even if you don't ask for it (and
with Mac OS X and Solaris 10 and I think Windows in this category,
it's another significant subset of platforms that may be worth
supporting specially), perhaps it's possible to create worker threads
that are near enough to invisible that the main application won't
notice, and call getaddrinfo in multiple worker threads. There are a
few issues I've thought about so far:
- Process exit. If the main application is multithreaded and all
of its threads call pthread_exit, the process is supposed to go
away. That means the worker threads can't hang around for a long
time waiting for something to do; they get launched, do some stuff,
and go away. We can't rely on krb5_context objects being destroyed
before pthread_exit gets called. Long delays looking up DNS names
could cause the worker thread to wait around for a while after the
main thread has already gotten a response from the KDC. There are
three approaches I see here: (1) Block for all lookup threads to exit
before returning, incurring the longest of the outstanding lookup
delays. (2) Call pthread_cancel, and hope that getaddrinfo is
implemented in a way that cancels cleanly and promptly. (3) Leave it
running, detached, and suffer with the occasional process that
finishes up its Kerberos work quickly but still takes time to exit.
- Library unloading. If the MIT libraries get unloaded, we
shouldn't have any threads running around executing code from those
libraries. The library fini function can either cancel the threads,
or call pthread_join on them (not compatible with detaching above)
and wait for them to exit. We can rely on function calls in the
library not being in progress, but probably shouldn't depend on all
krb5_context objects having been destroyed. (If they haven't been,
leaking memory is okay, because the application is being sloppy, but
crashing is probably not okay.)
- Signal handling. We can create a thread that won't accept most
signals, by creating a signal mask using sigfillset, clearing some
like SIGSEGV from the mask, using pthread_sigblock to block those
signals, calling pthread_create to create a new thread that starts
with all those signals blocked, then using pthread_sigblock in the
parent thread to restore the old mask. Then if random signals come
in, they should always get handled by the main application thread(s)
instead of accidentally being re-routed to our worker threads.
- Access to errno? A supposedly single-threaded application may
access errno as a global variable; thread-aware code must use
function calls or thread-local storage. I'm assuming at the moment,
without testing, that the variable version of errno will still work
from the main thread. But it's certainly conceivable that the global
variable could be something other than the errno location for the
main thread, and errno-setting code in the system libraries could be
aware of whether multiple threads were running, and stop updating the
global variable once a second thread exists. This needs a little
more research.
If we make the "interface" change that we may keep worker threads
around while krb5_context objects are around, perhaps the worker
threads can be longer-lived, but that does have new consequences at
least for programs unloading our library, or exiting via
pthread_exit, without first cleaning up. If multiple credentials are
to be acquired, we may go through the name resolution process more
than once, so keeping the worker threads around a little while may help.
Obviously, the specifics of the issues above have to do with the
POSIX thread interface. I'm not familiar enough with the Windows one
yet to know if the same problems still apply. And this still leaves
us with fully-synchronous name lookups on systems that don't pull
thread support into every process, and don't have getaddrinfo_a.
Are there other issues preventing us from using threads under the
covers like this?
As for the actual work done, we'd look up the hostnames in parallel,
and as addresses come in, add them to the address-list described
above. (The "stop and look up another hostname, then check for
received responses" part of the above algorithm would be replaced by
a delay until a new address became available, or data was received on
one of the open sockets.) For SRV records with different priorities,
we could look up lower-priority hostnames only after higher-priority
hostname lookups were completed (successfully or not), or we could go
ahead and look them up, and just not pass on the addresses found
until the higher-priority lookups were done.
* When we're contacting a KDC, sometimes the caller wants to know
if it's one of the "master" KDCs. For that, we look up the master
KDCs in DNS or in the config file, look up their addresses, and then
check the list for the KDC from which we got a response. To be
thorough, we should handle a KDC hostname that isn't in the master-
KDC list, but has an address(+port) that is mapped to by one of the
master-KDC names as well. So *if* the caller wants that information,
once we try contacting a non-master KDC, we could start querying for
the master KDC addresses as well (if there are names we haven't
already looked up). I'm still thinking about this one....
Ken
More information about the krbdev
mailing list