[krbdev.mit.edu #5962] unchecked calls to k5_mutex_lock() interact poorly with finalizers

Thu May 15 19:15:36 EDT 2008

We need to ensure that all calls to k5_mutex_lock() check the return
values, to avoid assertion failures when the lock attempt fails and
k5_mutex_unlock() then does its consistency checks.  This can happen if
finalizers destroy the mutex before the call to k5_mutex_lock().  It is
also not safe to operate on shared data if k5_mutex_lock() fails, so we
should do something reasonable in that situation anyway.

From: Alexandra Ellwood <lxs at MIT.EDU>
Date: Thu, 15 May 2008 17:15:12 -0400

I think I have figured it out.  Sadly it's hard to fix completely, but
there is a relatively easy fix which should reduce the number of these
reports.

So the key that tipped me off was looking at the full crash log.
Obviously there's our poor crashed thread which is calling assert
because it doesn't actually own the lock it just locked.  Or, to be
precise, pthread_equal((&(&g_mechListLock)->os)->owner,
pthread_self()) is returning false:

Thread 16 Crashed:
0   libSystem.B.dylib             	0x92e8e17a
usleep$NOCANCEL$UNIX2003  + 0
1   libSystem.B.dylib             	0x92eaf9bb abort + 85
2   libSystem.B.dylib             	0x92ea1593 __assert_rtn + 101
3   edu.mit.Kerberos              	0x96bdb4d4 gssint_get_mechanism + 323
4   edu.mit.Kerberos              	0x96b60508 gss_init_sec_context + 133
5   sncgss.dyld                   	0x0592b680
sapgss_init_sec_context  + 779
6   libJPlatin.jnilib             	0x12eed293 SncPEstablishContext + 321
7   libJPlatin.jnilib             	0x12ef06aa SncPFrameIn + 7632
8   libJPlatin.jnilib             	0x12ee426b SncProcessInput + 921
9   libJPlatin.jnilib             	0x12edcd30 TmRead + 2321
10  libJPlatin.jnilib             	0x12e5a742
AgCom::Read(unsigned  char*, int, int*) + 154
11  libJPlatin.jnilib             	0x12e5a890
AgCom::CsRead(unsigned  char*, int, int*, int) + 146
12  libJPlatin.jnilib             	0x12d9e73e
AgComAdaptor::read(unsigned char**, int*) + 196
13  libJPlatin.jnilib             	0x12d9b674
Java_com_sap_platin_r3_protocol_ni_JniAgComAdaptor_jniRead + 58
14  ???                           	0x035859b1 0 + 56121777
15  ???                           	0x0357fa54 0 + 56097364
16  ???                           	0x0357faaa 0 + 56097450
17  ???                           	0x0357d227 0 + 56087079
18  libjvm.dylib                  	0x004d263a 0x3b9000 + 1152570
19  libjvm.dylib                  	0x00416763 0x3b9000 + 382819
20  libjvm.dylib                  	0x00416655 0x3b9000 + 382549
21  libjvm.dylib                  	0x00416595 0x3b9000 + 382357
22  libjvm.dylib                  	0x004164b8 0x3b9000 + 382136
23  libjvm.dylib                  	0x0067dcc5 JVM_RaiseSignal + 441189
24  libSystem.B.dylib             	0x92e3753b _pthread_body + 27

However, scrolling up a bit we find this thread:

Thread 3:
0   libSystem.B.dylib             	0x92dba9e6 mach_msg_trap + 10
1   libSystem.B.dylib             	0x92dc21dc mach_msg + 72
2   libSystem.B.dylib             	0x92dc1fe0 vproc_mig_look_up2 + 198
3   libSystem.B.dylib             	0x92dc1ea0 bootstrap_look_up2 + 73
4   libSystem.B.dylib             	0x92dc1e55 bootstrap_look_up + 55
5   edu.mit.Kerberos              	0x96b6ffd4
kipc_client_lookup_server + 327
6   edu.mit.Kerberos              	0x96b693ed cci_os_ipc + 264
7   edu.mit.Kerberos              	0x96b691ba _cci_ipc_send + 221
8   edu.mit.Kerberos              	0x96b692bf
cci_ipc_send_no_launch +  39
9   edu.mit.Kerberos              	0x96b65b39 cci_context_sync + 181
10  edu.mit.Kerberos              	0x96b66bce ccapi_context_release + 33
11  edu.mit.Kerberos              	0x96b62a07 krb5_stdcc_shutdown + 30
12  edu.mit.Kerberos              	0x96b4eb51 krb5int_lib_fini + 267
13  dyld                          	0x8fe12fc3
ImageLoaderMachO::doTermination(ImageLoader::LinkContext const&) + 227
14  dyld                          	0x8fe04f7b
dyld::runTerminators(void*) + 155
15  libSystem.B.dylib             	0x92de353c __cxa_finalize + 241
16  libSystem.B.dylib             	0x92de3430 exit + 33
17  libjvm.dylib                  	0x00473ba3 0x3b9000 + 764835
18  libjvm.dylib                  	0x00473b02 0x3b9000 + 764674
19  libjvm.dylib                  	0x004739c8 0x3b9000 + 764360
20  libjvm.dylib                  	0x0041603d 0x3b9000 + 380989
21  libjvm.dylib                  	0x00415d57 0x3b9000 + 380247
22  libjvm.dylib                  	0x0067dcc5 JVM_RaiseSignal + 441189
23  libSystem.B.dylib             	0x92e3753b _pthread_body + 27

Uh oh.  SAPGUI is in the middle of exiting and is running the library
finalizer for the krb5 library.  Unfortunately this probably isn't the
only library finalizer it has run.  Chances are good that this thread
has already run the finalizer for the GSS library...

After talking to Ken Raeburn, here's what we think is happening:

Thread 16 enters gssint_get_mechanism() and calls
gssint_initialize_library().  This function verifies the library is
loaded and g_mechListLock is valid.

Meanwhile the application is trying to quit due to some error or user
action.  Thread 3 calls exit() which calls dyld::runTerminators().
This runs the GSS library finalizer which calls
pthread_mutex_destroy() on g_mechListLock, invalidating the lock which
thread 16 is about to use.  Thread 3 then proceeds on to the krb5
library finalizer which blocks on a Mach-IPC call to the CCacheServer.

Meanwhile, thread 16 calls calls k5_mutex_lock() which returns an
error because g_mechListLock has been destroyed by thread 3.
Unfortunately, gssint_get_mechanism() discards the error from
k5_mutex_lock() and continues.  When it calls k5_mutex_unlock() the
owner is not set (because g_mechListLock was not locked) and thus we
get an assertion failure.

Now this particular problem can be easily fixed by making sure our
code checks the return value of k5_mutex_lock() everywhere it is
called and does not call k5_mutex_unlock() if k5_mutex_lock() fails.
After making this change there are probably still similar race
conditions which need to be analyzed.  I want to look at all of the
crash logs to make sure there are not other common scenarios which
cause similar race conditions.

Note that just ignoring the already-destroyed mutex isn't really safe
because we could end up garbling memory and potentially spewing it out
on the network before the application manages to exit.  So just
removing the assert() calls isn't really a good idea.  Application
crash on exit is bad, but better than a security hole.  :-)

--lxs

Alexandra Ellwood <lxs at mit.edu>
MIT Kerberos Consortium
<http://mit.edu/lxs/www>