thread safety requirements in MIT krb5 libraries

Thu Dec 18 00:41:27 EST 2003

The proposal for thread safety changes to the MIT Kerberos library 
almost two years ago had certain assumptions built in, some of which 
seem not to be relevant any longer, or at least not nearly as important 
as we had guessed they might be.

I'd like to get some discussion on some of them.

Previous discussions, for reference:
   http://diswww.mit.edu:8008/menelaus.mit.edu/krb5dev/6761
   http://mailman.mit.edu/pipermail/krbdev/2003-August/001838.html
and other messages from those threads.

Callbacks:

I don't think the ability to register callback functions for thread 
system operations will be needed.  It appears that for all the 
platforms we care about, there's a standard thread system (or more than 
one, mapping to the same basic OS interface so as to be interoperable). 
  We were concerned that people might want to be able to use, for 
example, the Gnu PTH library instead of a native, preemptive pthreads, 
but I haven't heard many people expressing interest.

Not supporting callback registration makes some things easier, like 
letting us use the system mutex type instead of always calling an 
allocation function that returns a pointer.  In particular, it also 
lets us use statically initialized mutex or pthread_once_t (or 
equivalent) objects.

I brought this up in August, it didn't generate a lot of discussion.

Single-threaded programs:

Obviously, a single-threaded program must continue to work.  I believe 
creating threads in a program written to be single-threaded may 
complicate the signal handling semantics quite a bit, so the library 
can't create threads of its own.  To the best of my knowledge, though, 
compiling against system threading headers and simply not ever calling 
the thread creation functions should be fully compatible with a 
single-threaded application.

It would be nice if we could avoid having to link against the system 
pthread library, when it is a separate library, if the program isn't 
going to create threads.  Many systems have support for weak references 
(where &foo is a null pointer rather than a link-time error if foo 
isn't linked in), which would let us make the pthread library optional. 
  If that doesn't work on some system, however, I don't think it's a 
disaster if we start requiring the pthread library.  In shared-library 
builds, it would happen through library dependencies; in static-library 
builds, if the application builder is using the "krb5-config" script 
we've been trying to encourage people to use (with what degree of 
success, I don't know), then it should still be automatic.

Shim layer:

I think we still want a set of macros and functions to provide an API 
similar to a subset of the POSIX API, rather than always using the 
POSIX API directly and emulating it on platforms like Windows.

Using a new API requires us to specify precisely what we require of the 
underlying thread system, so that porting to a new non-POSIX system is 
more straightforward.  It also allows us to easily implement checks for 
weak references on the platforms that use them, without having to 
repeatedly code such conditionalized checks throughout the libraries.

Also, Sam has convinced me that an auxiliary library we link against 
may not be a totally evil thing, especially if it's automatically 
pulled in via shared library dependencies or "krb5-config --libs".  
This would be a good place to put any thread support functions we find 
we need.  (And also, probably, a replacement for getaddrinfo, for 
platforms where it doesn't exist, like the old IRIX version MIT is 
using, or where it's too broken for us to use, like Mac OS X.  And 
other stuff like that.)  This avoids any need to compile stuff into 
each library, with library-dependent prefixes to avoid name collisions.

I'm not suggesting any specific API at this point; still thinking about 
that.

Thread-safe objects:

We've been assuming for a long time that a krb5_context would be used 
in only one thread at a time, for performance reasons and to reduce the 
implementation work on our part.  I don't think we've talked about the 
krb5_auth_context objects; I've kind of assumed they'd always be used 
in conjunction with the krb5_context under which they were created.

There's been discussion about being able to use certain other objects 
in multiple threads, and to take objects created in one thread and 
context and use them in another thread and context serially.  
Specifically, replay caches, credentials caches, and key tables would 
be good to have locked so they can be shared across threads (especially 
the replay cache, for a multithreaded server).  Principal names and 
other small data we'd like to be able to "move" from one context to 
another, so they shouldn't share any data with the krb5_context, or if 
they do, we document the heck out of what functions return references 
to context data and what functions create independent copies.

This also means that, unlike in the scheme proposed two years ago, an 
arbitrary number of locks may be needed, so we need the ability to 
create and destroy them dynamically, rather than request a fixed number 
at initialization time.

File locking:

It looks like the standard UNIX/POSIX file locking techniques have some 
interesting drawbacks for multiple threads in contention over a file.  
Basically, the locks are per file and not per file descriptor, and per 
process and not per thread.  Furthermore, closing any file descriptor 
opened on a given file releases the locks associated with that file, so 
opening a file, doing an fstat to see if it's the same as an already 
opened file, and closing it, will release any locks held on the opened 
file in another thread of the same process.

We could maintain a global (per-process) list of files held open with 
locking capability, and check the 'fstat' data for files in this cache 
before locking another file.  A file matching an already open file can 
be closed as soon as we know we don't have any locks held on that file 
via other file descriptors.  If a file is opened and closed without 
locking, and another thread has a lock on another file descriptor on 
the same file, we may have a problem...but perhaps we could just have 
the close operation block on the release of the lock?

Sam suggested we could consider an application restriction that all 
references to a file must use the same name, i.e., "/var/tmp/foobar" 
and "/tmp/foobar" would be handled as separate files, even if "/tmp" is 
a symlink to "/var/tmp".  It would simplify things greatly, and in most 
cases wouldn't be a problem, but it still makes me uncomfortable.  
Using the pathname would, however, be a good first cut -- i.e., we 
shouldn't need to use fstat to know that two replay caches opened with 
the same absolute pathname will be the same, given that the replay 
cache is supposed to refresh from the file if it changes.

The filename checks would be a good idea in any case, though.  If two 
or ten threads open the same replay cache, it's silly to have multiple 
copies of all of the data, and have each thread reload it every time 
another thread changes it.

Dynamic loading:

I believe it's going to be a requirement that we be able to load the 
Kerberos or GSSAPI library dynamically, do some stuff with it, and 
unload it, and repeat the cycle, without resource leaks, at least for a 
properly written program.  So any thread-specific storage or 
globally-used heap storage we keep but hide from the application needs 
to be freed up when the library is unloaded.  That shouldn't be hard, 
but the internal APIs we use for per-thread storage might need a little 
adjusting from the POSIX versions to support this better.

What does "a properly written program" mean in this case?  Well, I 
assume the caller will probably have to free up any objects it's 
created through the library APIs before unloading the library.  The 
unload-time cleanup should only need to deal with stuff we maintain 
under the covers.

Cancellation:

It would probably be difficult, if not impossible, to make the library 
code be async cancel safe.  However, making it safe for synchronous 
cancellation may be doable.  How much do threaded programs actually use 
pthread_cancel or the Windows equivalent?

There's another kind of cancellation that might be desirable too.  If 
thread 1 manages a GUI with a cancel button, and thread 2 is waiting 
for packets from the nameserver or KDC, or is running a long 
calculation to generate a key, thread 1 may want to tell thread 2 to 
stop what it's doing.  Cancelling the entire thread is one way of doing 
that, but might we want to be able to cancel just the current 
operation, and propagate that fact up to the caller?  This is probably 
a question for people working on long-running Mac and Windows GUI 
programs; UNIX guys like me just hit control-C in our terminal windows. 
:-)

This latter form may just involve having a flag someplace (krb5 
context?) that will tell the Kerberos library to simply return a 
special error code from whatever it's currently doing, as quickly as 
possible.  But how do we get the flag set?  Do multiple threads come 
into it again?

Feature testing:

How does an application know if the Kerberos library it's using is 
thread-safe?  We can do all we want in the Kerberos library, but if 
we've only got gethostbyname available for address lookups, for 
example, the resulting program can't be thread-safe.  And I know at 
least one getaddrinfo implementation that's not thread-safe.

Should we provide a way to describe which objects can be used 
simultaneously from multiple threads and which cannot, in case we add 
mutex protection to additional objects in the future?

MIT applications:

Unless some good reason for it can be presented, none of MIT's Kerberos 
programs will use threads.  (KDC performance might be improved, for 
example, if one thread can decrypt an incoming message while another 
waits for disk blocks to be paged in from the database.  But I don't 
know that we need that sort of minor gain at this time, especially 
given the work required.)