[krbdev.mit.edu #8744] Issues when rolling the master key online
Greg Hudson via RT
rt-comment at KRBDEV-PROD-APP-1.mit.edu
Wed Oct 3 13:39:26 EDT 2018
I believe I know what went wrong on the master KDC in this scenario.
krb5_dbe_decrypt_key_data() in libkdb5 contains a mechanism for a
running process to continue working across a master key change, which
is:
1. try to decrypt the key with our current master key list
2. (if that fails) try to reread the master key list using our newest
master key
3. try again to decrypt the key with the newly read master key list
Importantly, step 2 does not re-read the stash file; it relies on the
auxiliary data in the K/M principal entry which contains copies of
the newest master key encrypted in the older ones.
Therefore, if there is no KDC activity between
update_princ_encryption and purge_mkeys, step 2 will fail because the
KDC never got a chance to update its master key list before the K/M
auxiliary data was pruned. I can reproduce this symptom
("DECRYPT_CLIENT_KEY: ... Decrypt integrity check failed") by
changing t_mkey.py to do a purge_mkeys and kinit right after an
update_princ_encryption.
The simplest operational workaround is to wait a while before purging
the old master key, and to make sure that the master KDC and kadmind
see some activity during that window.
At this time I am not sure what the best fix is. We can document the
need for KDC/kadmind activity before purge_mkeys, but that's not
really satisfactory for a couple of reasons (it's hard to be really
sure that kadmind and the master key have had activity, and there is
no safety check). We could maybe make step 2 reread the stash file,
but aside from possible implementation difficulties, it's possible to
operate a KDC without a stash file. We could add a signal handler to
the KDC and kadmind which causes a reread of the master key list, but
that's not very elegant either.
I think you may be right that there is another potential issue if
update_princ_encryption and purge_mkeys are propagated to a replica
KDC too quickly, particularly if that happens via full dump, but I
haven't worked out details. Ideally I would like to untangle any
problems there from this issue and address it in a separate ticket.
You implied that you observed a kpropd crash when the master KDC
becomes non-functional. That would be a third potential bug. Can
you confirm that the process actually stopped running, and perhaps
produce a core file with a backtrace? (I can also try to reproduce
that failure myself.)
More information about the krb5-bugs
mailing list