exiting multithreaded processes, unloading libraries, and cleanup
raeburn at MIT.EDU
Fri Nov 14 19:20:54 EST 2008
Our library maintains some internal state, like per-thread data,
global linked lists of error tables, etc. We can't simply rewrite our
public APIs to get rid of it all, and we can't demand that people use
non-standard APIs for stuff like GSSAPI to help us get around it.
Because some applications may dynamically load and unload our
libraries, we use the OS-provided library finalization hooks to free
some things up at unload time (and I think we may miss some things but
that's "just" a bug). Unfortunately, the same hooks get used at
unload and process-exit time.
We've had some bug reports of sporadic crashes in the Kerberos library
when a multithreaded process exits. Generally the problem seems to be
that one process calls exit (which causes library cleanup functions to
be invoked, which in the Kerberos libraries frees up storage, deletes
per-thread-data keys, etc), but another process is still running code
in the Kerberos library (which needs the freed-up data).
It appears from http://msdn.microsoft.com/en-us/library/ms682583.aspx
that on Windows we can distinguish unloading a library from process
termination, so we can just skip the cleanup functions in the process
termination case, and let the actual process termination free up
resources. That just leaves the UNIX builds and KfM.
We can't use atexit() to set a flag to disable the cleanup functions,
because (1) we get no guarantee of the relative order of execution of
atexit handlers and library finalization functions, and (2) the
registered handler would have to be removed if the library is
unloaded, and that can't be done portably.
It occurred to me that a simple reference-count mechanism may be most
of what we need. Not only would an "initialized library" (init
function called, fini function not yet called) be a reference, but
certain objects like krb5_context would as well. So if one thread
calls exit and thus invokes the library finalizer functions, but
another thread is actively messing around with a krb5_context, the
additional internal data in the libraries won't be freed up unless the
context is destroyed before the exiting thread finishes the cleanup
functions. (If the process exits first, obviously all the process
resources go away at that time, as do any still-running threads that
might be using them.)
Once the cleanup has been done, ideally, only a smaller subset of
functions (like krb5_init_context) that might need some of this
internal state can be expected to be called. Those interfaces can
check for the presence of this internal state, or just some flags we
set up for the purpose, and return errors if they're called post-
cleanup. Unlike some random code somewhere in the middle of a library
function that's running when the cleanup functions zap the internal
state, the entry points of these functions won't be assuming the
existence of that internal state.
Obviously there's some refinement to do, like breaking it down by
library in case a process unloads gssapi but still has krb5 loaded
through other dependencies. And the other object types (ref counts)
and APIs (error on post-cleanup invocation) need to be figured out, so
it's not a trivial project.
There may be some places in the support library where this isn't
enough, but I think it'll take care of most of the problems.
Does anyone see a problem with this sort of approach?
More information about the krbdev