LMDB KDB module design notes

Mon Apr 9 10:45:07 EDT 2018

I have been considering how MIT krb5 might implement an LMDB KDB
module.

LMDB operations take place within read or write transactions.  Read
transactions do not block write transactions; instead, read transactions
delay the reclamation of pages obsoleted by write transactions.  This is
attractive for a KDB, as it means "kdb5_util dump" can take a snapshot
of the database without blocking password changes or administrative
operations.  (The DB2 module allows this with the "unlockiter" DB
option, but that option carries a noticeable performance penalty, causes
kdb5_util dump to write something which isn't exactly a snapshot, and is
probably open to rare edge cases where an admin deletes a principal
entry right as it's being iterated through.)

"kdb5_util load" is our one transactional write operation.  It calls
krb5_db_create() with the "temporary" DB option, puts principal and
policy entries, and then calls krb5_db_promote() to make the new KBD
visible.  The DB2 module handles this by creating side databases and
lockfiles with a "~" extension, and then renaming them into place.  For
this to work, each kdb_db2 operation needs to close and reopen the
database.

The three lockout fields of principal entries (last_success,
last_failed, and fail_auth_count) add additional complexity.  These
fields are updated by the KDC by default, and are not replicated in an
iprop setup.  iprop loads include the "merge_nra" DB option when
creating the side database, indicating that existing principal entries
should retain their current lockout attribute values.

Here is my general design framework, taking the above into
consideration:

* We use two MDB environments, setting the MDB_NOSUBDIR flag so that
  each environment is a pair of files instead of a subdirectory:

  - A primary environment (suffix ".mdb") containing a "policy" database
    holding policy entries and a "principal" database holding principal
    entries minus lockout fields.

  - A secondary environment (suffix ".lockout.mdb") containing a
    "lockout" database holding principal lockout fields.

  The KDC only needs to write to the lockout environment, and can open
  the primary environment read-only.

  The lockout environment is never emptied, never iterated over, and
  uses only short-lived transactions, so the KDC is never blocked more
  than briefly.

* For creations with the "temporary" DB option, instead of creating a
  side database, we open or create the usual environment files, begin a
  write transaction on the primary environment for the lifetime of the
  database context, and open and drop the principal and policy databases
  within that transaction.  put_principal and put_policy operations use
  the database context write transaction instead of creating short-lived
  ones.  When the database is promoted, we commit the write transaction
  and the load becomes visible.

  To maintain the low-contention nature of the lockout environment, we
  compromise on the transactionality of load operations for the lockout
  fields.  We do not empty the lockout database on a load and we write
  entries to it as put_principal operations occur during the load.
  Therefore:

  - updates to the lockout fields become visible immediately (for
    existing principal entries), instead of at the end of the load.

  - updates to the lockout fields remain visible (for existing principal
    entries) if the load operation is aborted.

  - since we don't empty the lockout database, we leave garbage entries
    behind for old principals which have disappeared from the dump file
    we loaded.

  I don't anticipate any of those behaviors being noticeable in
  practice.  We could provide a tool to remove the garbage entries in
  the lockout database if it becomes an issue for anyone.

* For iprop loads, we set a context flag if we see the "merge_nra" DB
  option at creation time.  If the context flag is set, put_principal
  operations check for existing entries in the lockout database before
  writing, and do nothing if an entry is already there.

* To iterate over principals or policies, we create a read transaction
  in the primary MDB environment for the lifetime of the cursor.  By
  default, LMDB only allows one transaction per environment per thread.
  This would break "kdb5_util update_princ_encryption", which does
  put_principal operations during iteration.  Therefore, we must specify
  the MDB_NOTLS flag in the primary environment.

  The MDB_NOTLS flag carries a performance penalty for the creation of
  read transactions.  To mitigate this penalty, we can save a read
  transaction handle in the DB context for get operations, using
  mdb_txn_reset() and mdb_txn_renew() between operations.

* The existing in-tree KDB modules allow simultaneous access to the same
  DB context by multiple threads, even though the KDC and kadmind are
  single-threaded and we don't allow krb5_context objects to be used by
  multiple threads simultaneously.  For the LMDB module, we will need to
  either synchronize the use of transaction handles, or document that it
  isn't thread-safe and will need mutexes added if it needs to be
  thread-safe in the future.

* LMDB files are capped at the memory map size, which is 10MB by
  default.  Heimdal exposes this as a configuration option and we should
  probably do the same; we might also want a larger default like 128MB.
  We will have to consider how to apply any default map size to the
  lockout environment as well as the primary environment.

* LMDB also has a configurable maximum number of readers.  The default
  of 126 is probably adequate for most deployments, but we again
  probably want a configuration option in case it needs to be raised.

* By default LMDB calls fsync() or fdatasync() for each committed write
  transaction.  This probably overshadows the performance benefits of
  LMDB versus DB2, in exchange for improved durability.  I think we will
  want to always set the MDB_NOSYNC flag for the lockout environment,
  and might need to add an option to set it for the primary environment.