LMDB KDB module design notes
Nathaniel McCallum
npmccallum at redhat.com
Mon Apr 9 16:40:43 EDT 2018
This seems reasonable. I'm glad to see MIT considering LMDB (my
experiences with it are positive).
On Mon, Apr 9, 2018 at 10:45 AM, Greg Hudson <ghudson at mit.edu> wrote:
> I have been considering how MIT krb5 might implement an LMDB KDB
> module.
>
> LMDB operations take place within read or write transactions. Read
> transactions do not block write transactions; instead, read transactions
> delay the reclamation of pages obsoleted by write transactions. This is
> attractive for a KDB, as it means "kdb5_util dump" can take a snapshot
> of the database without blocking password changes or administrative
> operations. (The DB2 module allows this with the "unlockiter" DB
> option, but that option carries a noticeable performance penalty, causes
> kdb5_util dump to write something which isn't exactly a snapshot, and is
> probably open to rare edge cases where an admin deletes a principal
> entry right as it's being iterated through.)
>
> "kdb5_util load" is our one transactional write operation. It calls
> krb5_db_create() with the "temporary" DB option, puts principal and
> policy entries, and then calls krb5_db_promote() to make the new KBD
> visible. The DB2 module handles this by creating side databases and
> lockfiles with a "~" extension, and then renaming them into place. For
> this to work, each kdb_db2 operation needs to close and reopen the
> database.
>
> The three lockout fields of principal entries (last_success,
> last_failed, and fail_auth_count) add additional complexity. These
> fields are updated by the KDC by default, and are not replicated in an
> iprop setup. iprop loads include the "merge_nra" DB option when
> creating the side database, indicating that existing principal entries
> should retain their current lockout attribute values.
>
> Here is my general design framework, taking the above into
> consideration:
>
> * We use two MDB environments, setting the MDB_NOSUBDIR flag so that
> each environment is a pair of files instead of a subdirectory:
>
> - A primary environment (suffix ".mdb") containing a "policy" database
> holding policy entries and a "principal" database holding principal
> entries minus lockout fields.
>
> - A secondary environment (suffix ".lockout.mdb") containing a
> "lockout" database holding principal lockout fields.
>
> The KDC only needs to write to the lockout environment, and can open
> the primary environment read-only.
>
> The lockout environment is never emptied, never iterated over, and
> uses only short-lived transactions, so the KDC is never blocked more
> than briefly.
>
> * For creations with the "temporary" DB option, instead of creating a
> side database, we open or create the usual environment files, begin a
> write transaction on the primary environment for the lifetime of the
> database context, and open and drop the principal and policy databases
> within that transaction. put_principal and put_policy operations use
> the database context write transaction instead of creating short-lived
> ones. When the database is promoted, we commit the write transaction
> and the load becomes visible.
>
> To maintain the low-contention nature of the lockout environment, we
> compromise on the transactionality of load operations for the lockout
> fields. We do not empty the lockout database on a load and we write
> entries to it as put_principal operations occur during the load.
> Therefore:
>
> - updates to the lockout fields become visible immediately (for
> existing principal entries), instead of at the end of the load.
>
> - updates to the lockout fields remain visible (for existing principal
> entries) if the load operation is aborted.
>
> - since we don't empty the lockout database, we leave garbage entries
> behind for old principals which have disappeared from the dump file
> we loaded.
>
> I don't anticipate any of those behaviors being noticeable in
> practice. We could provide a tool to remove the garbage entries in
> the lockout database if it becomes an issue for anyone.
>
> * For iprop loads, we set a context flag if we see the "merge_nra" DB
> option at creation time. If the context flag is set, put_principal
> operations check for existing entries in the lockout database before
> writing, and do nothing if an entry is already there.
>
> * To iterate over principals or policies, we create a read transaction
> in the primary MDB environment for the lifetime of the cursor. By
> default, LMDB only allows one transaction per environment per thread.
> This would break "kdb5_util update_princ_encryption", which does
> put_principal operations during iteration. Therefore, we must specify
> the MDB_NOTLS flag in the primary environment.
>
> The MDB_NOTLS flag carries a performance penalty for the creation of
> read transactions. To mitigate this penalty, we can save a read
> transaction handle in the DB context for get operations, using
> mdb_txn_reset() and mdb_txn_renew() between operations.
>
> * The existing in-tree KDB modules allow simultaneous access to the same
> DB context by multiple threads, even though the KDC and kadmind are
> single-threaded and we don't allow krb5_context objects to be used by
> multiple threads simultaneously. For the LMDB module, we will need to
> either synchronize the use of transaction handles, or document that it
> isn't thread-safe and will need mutexes added if it needs to be
> thread-safe in the future.
>
> * LMDB files are capped at the memory map size, which is 10MB by
> default. Heimdal exposes this as a configuration option and we should
> probably do the same; we might also want a larger default like 128MB.
> We will have to consider how to apply any default map size to the
> lockout environment as well as the primary environment.
>
> * LMDB also has a configurable maximum number of readers. The default
> of 126 is probably adequate for most deployments, but we again
> probably want a configuration option in case it needs to be raised.
>
> * By default LMDB calls fsync() or fdatasync() for each committed write
> transaction. This probably overshadows the performance benefits of
> LMDB versus DB2, in exchange for improved durability. I think we will
> want to always set the MDB_NOSYNC flag for the lockout environment,
> and might need to add an option to set it for the primary environment.
> _______________________________________________
> krbdev mailing list krbdev at mit.edu
> https://mailman.mit.edu/mailman/listinfo/krbdev
More information about the krbdev
mailing list