LMDB KDB module design notes

Thu Apr 12 08:03:27 EDT 2018

On Mon, 2018-04-09 at 10:45 -0400, Greg Hudson wrote:
> I have been considering how MIT krb5 might implement an LMDB KDB
> module.
> 
> LMDB operations take place within read or write transactions.  Read
> transactions do not block write transactions; instead, read transactions
> delay the reclamation of pages obsoleted by write transactions.  This is
> attractive for a KDB, as it means "kdb5_util dump" can take a snapshot
> of the database without blocking password changes or administrative
> operations.  (The DB2 module allows this with the "unlockiter" DB
> option, but that option carries a noticeable performance penalty, causes
> kdb5_util dump to write something which isn't exactly a snapshot, and is
> probably open to rare edge cases where an admin deletes a principal
> entry right as it's being iterated through.)
> 
> "kdb5_util load" is our one transactional write operation.  It calls
> krb5_db_create() with the "temporary" DB option, puts principal and
> policy entries, and then calls krb5_db_promote() to make the new KBD
> visible.  The DB2 module handles this by creating side databases and
> lockfiles with a "~" extension, and then renaming them into place.  For
> this to work, each kdb_db2 operation needs to close and reopen the
> database.
> 
> The three lockout fields of principal entries (last_success,
> last_failed, and fail_auth_count) add additional complexity.  These
> fields are updated by the KDC by default, and are not replicated in an
> iprop setup.  iprop loads include the "merge_nra" DB option when
> creating the side database, indicating that existing principal entries
> should retain their current lockout attribute values.
> 
> Here is my general design framework, taking the above into
> consideration:
> 
> * We use two MDB environments, setting the MDB_NOSUBDIR flag so that
>   each environment is a pair of files instead of a subdirectory:
> 
>   - A primary environment (suffix ".mdb") containing a "policy" database
>     holding policy entries and a "principal" database holding principal
>     entries minus lockout fields.
> 
>   - A secondary environment (suffix ".lockout.mdb") containing a
>     "lockout" database holding principal lockout fields.
> 
>   The KDC only needs to write to the lockout environment, and can open
>   the primary environment read-only.
> 
>   The lockout environment is never emptied, never iterated over, and
>   uses only short-lived transactions, so the KDC is never blocked more
>   than briefly.

I am not a fan of setups that use multiple files for databases, especially when
transactions need to span multiple ones.
What is the underlying reason to do this in the new design instead of using a
single database file with all the data ?

> * For creations with the "temporary" DB option, instead of creating a
>   side database, we open or create the usual environment files, begin a
>   write transaction on the primary environment for the lifetime of the
>   database context, and open and drop the principal and policy databases
>   within that transaction.  put_principal and put_policy operations use
>   the database context write transaction instead of creating short-lived
>   ones.  When the database is promoted, we commit the write transaction
>   and the load becomes visible.
> 
>   To maintain the low-contention nature of the lockout environment, we
>   compromise on the transactionality of load operations for the lockout
>   fields.  We do not empty the lockout database on a load and we write
>   entries to it as put_principal operations occur during the load.
>   Therefore:
> 
>   - updates to the lockout fields become visible immediately (for
>     existing principal entries), instead of at the end of the load.
> 
>   - updates to the lockout fields remain visible (for existing principal
>     entries) if the load operation is aborted.
> 
>   - since we don't empty the lockout database, we leave garbage entries
>     behind for old principals which have disappeared from the dump file
>     we loaded.
> 
>   I don't anticipate any of those behaviors being noticeable in
>   practice.  We could provide a tool to remove the garbage entries in
>   the lockout database if it becomes an issue for anyone.
> 
> * For iprop loads, we set a context flag if we see the "merge_nra" DB
>   option at creation time.  If the context flag is set, put_principal
>   operations check for existing entries in the lockout database before
>   writing, and do nothing if an entry is already there.
> 
> * To iterate over principals or policies, we create a read transaction
>   in the primary MDB environment for the lifetime of the cursor.  By
>   default, LMDB only allows one transaction per environment per thread.
>   This would break "kdb5_util update_princ_encryption", which does
>   put_principal operations during iteration.  Therefore, we must specify
>   the MDB_NOTLS flag in the primary environment.
> 
>   The MDB_NOTLS flag carries a performance penalty for the creation of
>   read transactions.  To mitigate this penalty, we can save a read
>   transaction handle in the DB context for get operations, using
>   mdb_txn_reset() and mdb_txn_renew() between operations.
> 
> * The existing in-tree KDB modules allow simultaneous access to the same
>   DB context by multiple threads, even though the KDC and kadmind are
>   single-threaded and we don't allow krb5_context objects to be used by
>   multiple threads simultaneously.  For the LMDB module, we will need to
>   either synchronize the use of transaction handles, or document that it
>   isn't thread-safe and will need mutexes added if it needs to be
>   thread-safe in the future.
> 
> * LMDB files are capped at the memory map size, which is 10MB by
>   default.  Heimdal exposes this as a configuration option and we should
>   probably do the same; we might also want a larger default like 128MB.
>   We will have to consider how to apply any default map size to the
>   lockout environment as well as the primary environment.
> 
> * LMDB also has a configurable maximum number of readers.  The default
>   of 126 is probably adequate for most deployments, but we again
>   probably want a configuration option in case it needs to be raised.
> 
> * By default LMDB calls fsync() or fdatasync() for each committed write
>   transaction.  This probably overshadows the performance benefits of
>   LMDB versus DB2, in exchange for improved durability.  I think we will
>   want to always set the MDB_NOSYNC flag for the lockout environment,
>   and might need to add an option to set it for the primary environment.
> _______________________________________________
> krbdev mailing list             krbdev at mit.edu
> https://mailman.mit.edu/mailman/listinfo/krbdev

-- 
Simo Sorce
Sr. Principal Software Engineer
Red Hat, Inc