mechglue registration of gss_buffer_t pointers

Thu Nov 1 17:56:06 EDT 2007

This is a fairly long reply; sorry.

On Thu, Nov 01, 2007 at 05:12:45PM -0400, Tom Yu wrote:
> >>>>> "nico" == Nicolas Williams <Nicolas.Williams at sun.com> writes:
> nico> Scalability and overall performance relative to the wall clock time to
> nico> do a per-message operation.  Which I agree should translate to "lock
> nico> contention."
> 
> Do you have reason to believe that a hash table implementation using
> per-object mutexes plus one whole-table mutex will experience
> excessive lock contention?  I'm trying to determine if a finer-grained
> locking scheme (e.g. per-bucket mutexes) is justified.

Without any implementation of buffer registration to play with, or data
from one, it's hard to tell.

I'm not sure what all the problems, if any, might turn out to be.  Cache
contention might be a problem too.

I suspect that cache and lock contention could be significant problems.
Don't forget that I'm still interested in a GSS pseudo-mech that
negotiates channel binding and which has non-crypto per-msg tokens.  For
such a mechanism the cost of non-existent crypto will not dominate the
cost of buffer registration.

Using various atomic operations you might (should) be able to have a
mostly-lockless implementation.  That might not (likely would not) be as
portable, so you might have multiple implementations of buffer
registration, or perhaps vendors will substitute their own as needed.

> Basically I'm considering:
> 
> * hash chain node holds the registered pointer value and a mutex
>   controlling freeing of that pointer
> 
> * whole-table mutex controls access to the table, including lookups,
>   linking, and unlinking

That seems like a recipe for lock contention.

A per-thread table would avoid this, assuming that the same threads that
create a token will release it, with a penalty for releasing buffers in
threads other than the ones where they were allocated.

I'm not sure if Solaris' NFS client and server release GSS buffers in
the same threads where they are created.  I should find out.  But other
implemenations (e.g., CITI's implementation for Linux) might differ
anyways, which in turn might create a need for multiple implementations
(much like Solaris ships multiple malloc()/free() implementations that
can be linked to directly, LD_PRELOADed, ...).

In user-land I expect most GSS applications either use very few per-msg
tokens (e.g., SSHv2 implementations) or are typically single-threaded
(e.g., SASL apps).

> * mechglue registers a pointer by doing
> 
>   ** acquire lock on whole table
>   ** verify that pointer is not already in table
>   ** allocate new node
>   ** store pointer and specific mech ID
>   ** link node
>   ** release lock on whole table

In a multi-threaded program with many live buffers (think NFS client/
server) this sounds like a recipe for both, lock contention and cache
contention.

> * mechglue gss_release_buffer() does
> 
>   ** acquire lock on whole table
>   ** acquire lock on the node
>   ** release lock on whole table
>   ** call specific mech's gss_release_buffer()
>   ** acquire lock on whole table
>   ** unlink node
>   ** unlock and free node
>   ** release lock on whole table

See above.

Also, if the application releases buffers more or less in same order as
they are allocated and soon after allocating them, then you could use a
circular log with O(1) addition, O(N) deletion (because of linear
searches) that most of the time completes in O(1).  OTOH, such a design
would probably perform very poorly for apps that use replay caches and
retransmission.

I.e., there are many optimizations you could make if you have a suitable
characterization of the application's buffer allocation/release
patterns.

But the main application where the performance of buffer registration
seems likely to matter, for me, is NFS.

> Whether or not to hold the whole-table lock across the call to the
> specific mech's gss_release_buffer() depends on how much risk there is
> that a mech will have a slow release_buffer() implementation.

I think the call to the mech-specific gss_release_buffer() should be the
last step, that the buffer should already be de-registered and the table
unlocked.

Do we actually have consensus that support for GSS-API mechglue plugins
written in HLLs but with a C SPI is a requirement?  I think it would be
very good if we could meet such a requirement if the trade-offs are
acceptable.  Assuming that we really must have this then we now need:

 - characterizations of existing applications' GSS per-msg token
   function calls and corresponding calls to gss_release_buffer()

 - an implementation of buffer registration

 - a test/profile suite that models the characterized GSS apps

In a worst case scenario we can establish a simple convention that for
Solaris kernel GSS-API mechanism plugins you must use a single allocator
and the framework will release buffers accordingly, while in user-land
we can adopt whatever you guys decide to do.  But since the Solaris
implementation of the Kerberos V GSS mechanism shares source code for
user- and kernel-land that would mean either adding #ifdefs that MIT
would have to accept or keeping the source forked (but we'd like to
avoid this).

Nico
--