thread safety requirements in MIT krb5 libraries

Thu Dec 18 15:30:25 EST 2003

On Thursday, December 18, 2003 00:41:27 -0500 Ken Raeburn <raeburn at mit.edu> 
wrote:

> Callbacks:
>
> I don't think the ability to register callback functions for thread
> system operations will be needed.  It appears that for all the platforms
> we care about, there's a standard thread system (or more than one,
> mapping to the same basic OS interface so as to be interoperable).   We
> were concerned that people might want to be able to use, for example, the
> Gnu PTH library instead of a native, preemptive pthreads, but I haven't
> heard many people expressing interest.

At some point, AFS is going to gain a Kerberos dependency.  When that 
happens, being tied to a single threading model will be mildly annoying, 
since it won't be ours.  But it's still the right approach, IMHO, and I 
imagine we'll make do.

> Thread-safe objects:
>
> We've been assuming for a long time that a krb5_context would be used in
> only one thread at a time, for performance reasons and to reduce the
> implementation work on our part.  I don't think we've talked about the
> krb5_auth_context objects; I've kind of assumed they'd always be used in
> conjunction with the krb5_context under which they were created.

> There's been discussion about being able to use certain other objects in
> multiple threads, and to take objects created in one thread and context
> and use them in another thread and context serially.  Specifically,
> replay caches, credentials caches, and key tables would be good to have
> locked so they can be shared across threads (especially the replay cache,
> for a multithreaded server).

For some of these objects, I'd think it would be reasonable to declare them 
non-reentrant, and make locking the application's problem -- as long as 
they're not tied to a particular krb5_context, and don't magically get 
destroyed along with the context used to create them.  I suspect 
krb5_auth_context might also fall into this category.

> File locking:
>
> We could maintain a global (per-process) list of files held open with
> locking capability, and check the 'fstat' data for files in this cache
> before locking another file.

Actually, you can do one better.  Before opening a file, stat it by name, 
and see if it matches one of the things in the cache.  If so, you can block 
on actually opening it until the matching cache entry is free.  It's not 
perfect -- if you decide there is _not_ a matching cache entry, then 
there's a race which could result in opening something that turns out to be 
in the cache after all (if someone substitutes the file out from under 
you).  But this isn't really any worse than the situation without the 
pre-open check.

> Sam suggested we could consider an application restriction that all
> references to a file must use the same name, i.e., "/var/tmp/foobar" and
> "/tmp/foobar" would be handled as separate files, even if "/tmp" is a
> symlink to "/var/tmp".  It would simplify things greatly, and in most
> cases wouldn't be a problem, but it still makes me uncomfortable.

Me too.  Pathnames often come from users, not application developers. 
Users will expect the correct behaviour if they use two names that refer to 
the same file.  If a filename check is not used, what they'll see instead 
is a hard-to-reproduce, non-deterministic case where sometimes a file gets 
corrupted.  It will generate support load for application developers and 
for krbdev.

> Using
> the pathname would, however, be a good first cut -- i.e., we shouldn't
> need to use fstat to know that two replay caches opened with the same
> absolute pathname will be the same, given that the replay cache is
> supposed to refresh from the file if it changes.

Probably true.  There's an interesting corner case, though -- if the files 
do turn out to be different, which data do you store in the stat cache?  If 
you keep the old data, then you may be able to deal with someone moving the 
replay cache out from under you and replacing it with a new one (is that 
expected?), but you'll never notice if some other pathname is the same as 
the new file -- unless it happened to be the same as the old file (e.g. a 
file in /var/tmp with /tmp a symlink there).

> The filename checks would be a good idea in any case, though.  If two or
> ten threads open the same replay cache, it's silly to have multiple
> copies of all of the data, and have each thread reload it every time
> another thread changes it.

Absolutely -- if multiple threads open the same replay cache, they should 
get references to the same data structure, with locking inside the library. 
The same is probably true of keytabs and ccaches, though with those the 
update rate is low enough that it might not be worth the effort.

> Dynamic loading:
>
> I believe it's going to be a requirement that we be able to load the
> Kerberos or GSSAPI library dynamically, do some stuff with it, and unload
> it, and repeat the cycle, without resource leaks, at least for a properly
> written program.  So any thread-specific storage or globally-used heap
> storage we keep but hide from the application needs to be freed up when
> the library is unloaded.  That shouldn't be hard, but the internal APIs
> we use for per-thread storage might need a little adjusting from the
> POSIX versions to support this better.

Hrm.  POSIX allows an implementation-defined limit on the number of 
thread-specific storage keys that can be created.  I don't know what the 
lower bound is on this limit, but it's probably best to strictly limit the 
number of these that might be created by Kerberos.

> Cancellation:
>
> It would probably be difficult, if not impossible, to make the library
> code be async cancel safe.  However, making it safe for synchronous
> cancellation may be doable.  How much do threaded programs actually use
> pthread_cancel or the Windows equivalent?

I'm not sure how hard it might be in practice -- mostly you need to make 
sure you release any locks you're holding, and don't leave the data 
structures they're protecting in a broken state.

However...  Solaris documentation indicates that default cancellation type 
is deferred, and that it is appropriate to expect applications and 
libraries not to be async cancel safe.

> There's another kind of cancellation that might be desirable too.  If
> thread 1 manages a GUI with a cancel button, and thread 2 is waiting for
> packets from the nameserver or KDC, or is running a long calculation to
> generate a key, thread 1 may want to tell thread 2 to stop what it's
> doing.  Cancelling the entire thread is one way of doing that, but might
> we want to be able to cancel just the current operation, and propagate
> that fact up to the caller?  This is probably a question for people
> working on long-running Mac and Windows GUI programs; UNIX guys like me
> just hit control-C in our terminal windows. :-)
>
> This latter form may just involve having a flag someplace (krb5 context?)
> that will tell the Kerberos library to simply return a special error code
> from whatever it's currently doing, as quickly as possible.  But how do
> we get the flag set?  Do multiple threads come into it again?

You put a flag in the krb5_context, protect it with a mutex, and provide an 
API call to set it.  This API call has the unique property that it can 
safely be called on a krb%_context that is in use by another thread.  And, 
of course, you check the flag at appropriate times.

> Feature testing:
>
> How does an application know if the Kerberos library it's using is
> thread-safe?  We can do all we want in the Kerberos library, but if we've
> only got gethostbyname available for address lookups, for example, the
> resulting program can't be thread-safe.  And I know at least one
> getaddrinfo implementation that's not thread-safe.

You lock around it, if necessary.  And possibly, you document which API 
routines use possibly unsafe library routines that the app might have to 
lock around.

> Should we provide a way to describe which objects can be used
> simultaneously from multiple threads and which cannot, in case we add
> mutex protection to additional objects in the future?

No.  Document the API, and leave it at that.  Runtime checks add a level of 
complexity both for Kerberos and the application, and I don't think they'll 
be heavily used.

-- Jeff