exiting multithreaded processes, unloading libraries, and cleanup

Fri Nov 14 19:20:54 EST 2008

Our library maintains some internal state, like per-thread data,  
global linked lists of error tables, etc.  We can't simply rewrite our  
public APIs to get rid of it all, and we can't demand that people use  
non-standard APIs for stuff like GSSAPI to help us get around it.   
Because some applications may dynamically load and unload our  
libraries, we use the OS-provided library finalization hooks to free  
some things up at unload time (and I think we may miss some things but  
that's "just" a bug).  Unfortunately, the same hooks get used at  
unload and process-exit time.

We've had some bug reports of sporadic crashes in the Kerberos library  
when a multithreaded process exits.  Generally the problem seems to be  
that one process calls exit (which causes library cleanup functions to  
be invoked, which in the Kerberos libraries frees up storage, deletes  
per-thread-data keys, etc), but another process is still running code  
in the Kerberos library (which needs the freed-up data).

It appears from http://msdn.microsoft.com/en-us/library/ms682583.aspx  
that on Windows we can distinguish unloading a library from process  
termination, so we can just skip the cleanup functions in the process  
termination case, and let the actual process termination free up  
resources.  That just leaves the UNIX builds and KfM.

We can't use atexit() to set a flag to disable the cleanup functions,  
because (1) we get no guarantee of the relative order of execution of  
atexit handlers and library finalization functions, and (2) the  
registered handler would have to be removed if the library is  
unloaded, and that can't be done portably.

It occurred to me that a simple reference-count mechanism may be most  
of what we need.  Not only would an "initialized library" (init  
function called, fini function not yet called) be a reference, but  
certain objects like krb5_context would as well.  So if one thread  
calls exit and thus invokes the library finalizer functions, but  
another thread is actively messing around with a krb5_context, the  
additional internal data in the libraries won't be freed up unless the  
context is destroyed before the exiting thread finishes the cleanup  
functions.  (If the process exits first, obviously all the process  
resources go away at that time, as do any still-running threads that  
might be using them.)

Once the cleanup has been done, ideally, only a smaller subset of  
functions (like krb5_init_context) that might need some of this  
internal state can be expected to be called.  Those interfaces can  
check for the presence of this internal state, or just some flags we  
set up for the purpose, and return errors if they're called post- 
cleanup.  Unlike some random code somewhere in the middle of a library  
function that's running when the cleanup functions zap the internal  
state, the entry points of these functions won't be assuming the  
existence of that internal state.

Obviously there's some refinement to do, like breaking it down by  
library in case a process unloads gssapi but still has krb5 loaded  
through other dependencies.  And the other object types (ref counts)  
and APIs (error on post-cleanup invocation) need to be figured out, so  
it's not a trivial project.

There may be some places in the support library where this isn't  
enough, but I think it'll take care of most of the problems.

Does anyone see a problem with this sort of approach?

Ken