mechglue registration of gss_buffer_t pointers

Fri Nov 2 20:13:22 EDT 2007

>>>>> "nico" == Nicolas Williams <Nicolas.Williams at sun.com> writes:

nico> On Thu, Nov 01, 2007 at 05:12:45PM -0400, Tom Yu wrote:
>> >>>>> "nico" == Nicolas Williams <Nicolas.Williams at sun.com> writes:
nico> Scalability and overall performance relative to the wall clock time to
nico> do a per-message operation.  Which I agree should translate to "lock
nico> contention."
>> 
>> Do you have reason to believe that a hash table implementation using
>> per-object mutexes plus one whole-table mutex will experience
>> excessive lock contention?  I'm trying to determine if a finer-grained
>> locking scheme (e.g. per-bucket mutexes) is justified.

nico> Without any implementation of buffer registration to play with, or data
nico> from one, it's hard to tell.

nico> I'm not sure what all the problems, if any, might turn out to be.  Cache
nico> contention might be a problem too.

nico> I suspect that cache and lock contention could be significant problems.
nico> Don't forget that I'm still interested in a GSS pseudo-mech that
nico> negotiates channel binding and which has non-crypto per-msg tokens.  For
nico> such a mechanism the cost of non-existent crypto will not dominate the
nico> cost of buffer registration.

How fast do you expect the per-msg GSS calls to be in such a case?

nico> Using various atomic operations you might (should) be able to have a
nico> mostly-lockless implementation.  That might not (likely would not) be as
nico> portable, so you might have multiple implementations of buffer
nico> registration, or perhaps vendors will substitute their own as needed.

Are you sure that the added complexity and portability difficulties of
a lock-free implementation are worth the potential performance gains?
Certainly we could modularize the buffer registration interface to
allow vendor-specific optimizations.  To reduce cache contention, we
could use pool allocation, perhaps even thread-specific pools if we
care about lock contention around node allocation.

To put things in perspective, we already use a suboptimal registration
strategy today.  Our krb5 mechanism uses a singly-linked list
registration facility for many allocated GSS-API objects (context
handles, cred handles, etc.) and keeps a single lock held while
traversing the list for lookups and deletions.  Do you find that this
bottleneck creates problems for your use cases in Solaris NFS?

>> Basically I'm considering:
>> 
>> * hash chain node holds the registered pointer value and a mutex
>> controlling freeing of that pointer
>> 
>> * whole-table mutex controls access to the table, including lookups,
>> linking, and unlinking

nico> That seems like a recipe for lock contention.

Is that such a problem if the whole-table lock isn't held across calls
to the specific mechanism's release_buffer()?

nico> A per-thread table would avoid this, assuming that the same threads that
nico> create a token will release it, with a penalty for releasing buffers in
nico> threads other than the ones where they were allocated.

I think a per-thread table isn't suitable because if there are
pseudo-mechanisms that can return a buffer previously allocated by
another specific mechanism, the registration code needs to avoid
making duplicate entries.

nico> I'm not sure if Solaris' NFS client and server release GSS buffers in
nico> the same threads where they are created.

I think we should not preclude buffers being released in a different
thread than the one from which they were allocated.

>> * mechglue registers a pointer by doing
>> 
>> ** acquire lock on whole table
>> ** verify that pointer is not already in table
>> ** allocate new node
>> ** store pointer and specific mech ID
>> ** link node
>> ** release lock on whole table

nico> In a multi-threaded program with many live buffers (think NFS client/
nico> server) this sounds like a recipe for both, lock contention and cache
nico> contention.

I think the code path that would keep the whole-table lock held will
execute fairly quickly and will not call out to specific mechanism
functions.  The above steps would occur only after mechglue regains
control from a specific mechanism procedure which has allocated
memory.

I suppose if the chains get too long, you eat up critical section time
doing a O(n) traversal of chain, but if this actually becomes a
problem, we can investigate things like per-bucket locks.

nico> Also, if the application releases buffers more or less in same order as
nico> they are allocated and soon after allocating them, then you could use a
nico> circular log with O(1) addition, O(N) deletion (because of linear
nico> searches) that most of the time completes in O(1).  OTOH, such a design
nico> would probably perform very poorly for apps that use replay caches and
nico> retransmission.

I think this design also fails to prevent duplicate registrations as I
described above, unless I'm missing something here.

>> Whether or not to hold the whole-table lock across the call to the
>> specific mech's gss_release_buffer() depends on how much risk there is
>> that a mech will have a slow release_buffer() implementation.

nico> I think the call to the mech-specific gss_release_buffer() should be the
nico> last step, that the buffer should already be de-registered and the table
nico> unlocked.

I guess this would be safe if we know that there won't be another
thread attempting to register a pointer between the time that the
pointer is unregistered by the mechglue release_buffer() and when the
specific mechanism release_buffer() gets called.  Of course, that
would probably only happen if there were an application programming
error, so we may not want to worry about that.

nico> Do we actually have consensus that support for GSS-API mechglue plugins
nico> written in HLLs but with a C SPI is a requirement?  I think it would be
nico> very good if we could meet such a requirement if the trade-offs are
nico> acceptable.  Assuming that we really must have this then we now need:

I think we will find that there is a more general requirement for
allowing a specific mechanism to use its own allocator.  I understand
that every Win32 DLL has its own memory allocation arena and that
mixing these up can cause crashes.

nico>  - characterizations of existing applications' GSS per-msg token
nico>    function calls and corresponding calls to gss_release_buffer()

Getting some idea of your worst-case scenarios, e.g. Solaris kernel
NFS, would help, including typical number of threads (and CPUs)
calling into GSS, number of outstanding buffers, ordering of
release_buffer() operations with respect to the GSS calls that
produced the buffers (LIFO, FIFO, some random order...).

nico>  - an implementation of buffer registration

nico>  - a test/profile suite that models the characterized GSS apps

nico> In a worst case scenario we can establish a simple convention that for
nico> Solaris kernel GSS-API mechanism plugins you must use a single allocator
nico> and the framework will release buffers accordingly, while in user-land
nico> we can adopt whatever you guys decide to do.  But since the Solaris
nico> implementation of the Kerberos V GSS mechanism shares source code for
nico> user- and kernel-land that would mean either adding #ifdefs that MIT
nico> would have to accept or keeping the source forked (but we'd like to
nico> avoid this).

You might want to compile the kernel GSS code so that mechglue does no
buffer registration at all in that case and just calls out to the
generic deallocator.  I wasn't planning to implement in a way that
would preclude that possibility.

---Tom