Proposed new krb5 FILE ccache protocol

Mon Jan 27 17:50:36 EST 2014

Below is a description of a new FILE ccache that is backwards
interoperable and compatible with the current one, improves read
performance and probably write performance as well, recovers from
corruption, and is concurrency-safe without using POSIX file locking.
This design is Viktor's and mine.

The new ccache would consist of two components in the filesystem:

 - a "main file" with same format as FILE ccache, containing only the
   header, start TGT, and cc configs;

 - an ancillary directory (mkdtemp()'ed, named from the main file) with
   "hash buckets" which are basically FILE ccaches containing creds that
   hash into them.

The ccache type would still be "FILE", for backwards interop/compat,
since the new ccache type would be backwards interoperable/compatible.

All writes to any one file are to be renameat(2)-into-place writes (or
rename(2), if renameat(2) is not available).

Ancillary directory names should be named krb5ccd-XXXXX.

Files in the ancillary directory should be named as follows:

    krb5cctmp-<generation-number>-<bucket-number>

Hashing is needed because we cannot use unescaped principal names as
components of filenames (for obvious reasons, besides, the principal
names might be too long).  This helps us keep the amount of copying down
and speeds up lookups.  The bucket number will be between 0 and 63
(say); krb5_cred's that are neither cc configs nor start TGTs will be
hashed into a bucket number.

The prefix is necessary to prevent a security vulnerability on systems
that lack a renameat(2).  (A malicious user could create a symlink with
the same name as the old ancillary directory to get the loser of two
racing kdestroys to unlink files elsewhere.  This is avoided by using
unlinkat(2) where available.  Where unlinkat(2) is not available the
prefix business makes the names to be unlinked useless to the attacker.)

The generation number is there to ensure that running kinit results in
old tickets disappearing atomically, even if the removal process is
interruped.  The generation number is incremented on every kinit.
Racing to do this is fine as either way old tickets are -or at least
appear to be- removed.

Writes will generally be delayed until krb5_cc_close() time;
krb5_cc_{initialize, gen_new, new_unique, store_cred, destroy, ...}()
will queue up tasks in memory.  But when storing a non-cc config,
non-start TGT then krb5_cc_store_cred() should act immediately because
some apps hold ccaches open for long periods of time (perhaps this
decision can be based on whether KRB5_TC_OPENCLOSE is set).

(krb5_cc_last_change_time() may need to stat all buckets, or writing can
utimes(2) touch the main file.)

All the logic can be in a single source file in the library; no changes
to kinit, kvno, or kdestroy should be necessary.  This should be a
drop-in replacement for src/lib/krb5/ccache/cc_file.c.

The protocols for initializing, writing to, and destroying the new
ccache type are described below.

 - Search for a credential:

   a) read the main file, if found return, else get the ancillary
      directory name,
   b) hash the credential to a bucket,
   c) search that bucket.

   In all cases treat errors when reading a credential (e.g., lengths
   that are too long) as EOF.  I.e., treat corruption as EOF, not as any
   sort of fatal error.

   Enumeration (klist) is very similar: every bucket is iterated.

 - Write a new start TGT or cc config:

   a) read the main file to copy all other cc configs,
   b) mk[o]stemp() a new main file,
   c) write the new contents to the main file,
   d) rename(2) the new main file into place.

 - Write a new credential that is not a cc config or start TGT:

   a) read the main file to find the ancillary directory name,
   b) hash the credential to a bucket,
   c) mk[o]stemp() a new bucket in the ancillary directory,
   d) write the new cred to the new bucket,
   e) open the old bucket and copy up to N bytes (or creds) from the old
      bucket to the new if the old one exists,
   f) renameat(2) the new bucket into place.

   Note that the copy at step (e) can be a straight copy using write(2)
   of the old bucket mmap()ed in.  There's no need to iterate over
   bucket entries to write whole entries.  Partial tail entries are
   corrupt, but that's OK since we ignore tail corruption at read-time.
   This means that the copy step can be faster than iterative read and
   copy.  Indeed, this copy could even be done using aio_write(), with
   the renameat(2) done immediately after starting the write -- since
   incompletely-written entries will be treated as EOF, no harm results.

   Things to be hashed: cname, crealm, sname, srealm, session key
   enctype.

 - Initialization (first time):

   a) if the ccache exists and contains the ancillary directory name and
      start TGT, then re-initialize (see below), else continue,
   b) mkdtemp() the ancillary directory;
   c) generate a prefix for filenames in the ancillary directory;
   d) mk[o]stemp() the new main file including cc configs for various
      things *including* the name of the ancillary directory and the
      prefix for filenames in it;
   e) rename(2) the main file into place.

   Note that multiple  initializations can race, and some may leave
   empty temp directories lying around.

 - Re-initialization (e.g., kinit -R):

   a) read the existing main file to get the ancillary directory name
      and current generation number,
   b) setup a new main file with the same ancillary directory, and
      increment the generation number
   c) rename(2) into place the main file into place,
   d) and cleanup the ancillary directory by iterating the previous
      generation's bucket file names and using unlinkat(2) (if
      available, else unlink(2) to remove them), ignoring errors from
      unlinkat(2).

   There's a race condition here that can leave old buckets around, but
   the tickets stored there should be non-renewable, non-start TGTs, and
   should expire eventually.

   Interruption in the middle of unlinking old buckets can also leave
   them lying around.

 - Destroy:

   a) read the ancillary directory name and bucket prefix from the main
      file,
   b) unlink(2) the main file,
   c) unlinkat(2) all the files in the ancillary directory by iterating
      over the possible bucket names (ignoring errors from unlinkat(2)),
   d) rmdir() the ancillary directory.

   Interruption can leave old buckets lying around, but the tickets
   stored there should be non-renewable, non-start TGTs, and should
   expire eventually.

BENEFITS:

 - ABSOLUTELY NO POSIX FILE LOCKING.

 - Only depends on filesystem synchronization primitives that are
   concurrency-safe and also thread-safe.

 - Speeds up ccache lookups through hashing, by putting an upper bound
   on ccache and bucket size, by greatly reducing contention, and by
   putting most-recently-acquired tickets closest to the front in each
   bucket.

 - Speeds up writes by reducing contention, both by using hashing and by
   pushing all contention to renameat(2) (which greatly reduces the
   amount of time for which locks are held [in the filesystem]).

   Writes are made slower by the need to copy buckets, but this can be
   made asynchronous, and anyways, the amount to copy can be tuned.

 - Self-cleaning.  No need to kinit just to remove old tickets, though
   kinit (and kinit -R) will still have that effect (purposefully, to
   avoid surprises).

 - Backwards compatible:

    - old FILE ccache implementations may still corrupt the main file,
      but new implementations will recover automatically as corruption
      will not affect the main file entries written by new
      implementations;

    - old FILE ccache implementations will not find cached non-start
      TGTs written by new ones, but that's OK.

ISSUES:

 - kdestroys may be interrupted, leaving buckets with what should be
   non-renewable, non-start-TGTs lying around.  This is acceptable as
   the credentials in question will expire soon enough.

   If this is not acceptable then it can be addressed simply by
   super-encrypting ticket session keys in a key stored only in the main
   file.  Or even encrypting every bucket entry with a key stored only
   in the main file.

 - Racing initial ccache initializations can result in orphaned
   ancillary directories.  See above.

   Naming ancillary directories with a recognizable prefix allows for
   periodic cleanup of orphaned ancillary directories.

 - Racing krb5_cc_set_config() with kinit / kinit -R can result in
   losing the new start TGT.  In practice this won't happen as all calls
   to krb5_cc_set_config() are in the context of initialization.

We can make the worst-case only leak storage.  Malicious users cannot
cause others to leak storage.

Nico
--