[krbdev.mit.edu #8744] Issues when rolling the master key online

Wed Oct 3 13:39:26 EDT 2018

I believe I know what went wrong on the master KDC in this scenario.  
krb5_dbe_decrypt_key_data() in libkdb5 contains a mechanism for a 
running process to continue working across a master key change, which 
is:

1. try to decrypt the key with our current master key list
2. (if that fails) try to reread the master key list using our newest 
master key
3. try again to decrypt the key with the newly read master key list

Importantly, step 2 does not re-read the stash file; it relies on the 
auxiliary data in the K/M principal entry which contains copies of 
the newest master key encrypted in the older ones.

Therefore, if there is no KDC activity between 
update_princ_encryption and purge_mkeys, step 2 will fail because the 
KDC never got a chance to update its master key list before the K/M 
auxiliary data was pruned.  I can reproduce this symptom 
("DECRYPT_CLIENT_KEY: ... Decrypt integrity check failed") by 
changing t_mkey.py to do a purge_mkeys and kinit right after an 
update_princ_encryption.

The simplest operational workaround is to wait a while before purging 
the old master key, and to make sure that the master KDC and kadmind 
see some activity during that window.

At this time I am not sure what the best fix is.  We can document the 
need for KDC/kadmind activity before purge_mkeys, but that's not 
really satisfactory for a couple of reasons (it's hard to be really 
sure that kadmind and the master key have had activity, and there is 
no safety check).  We could maybe make step 2 reread the stash file, 
but aside from possible implementation difficulties, it's possible to 
operate a KDC without a stash file.  We could add a signal handler to 
the KDC and kadmind which causes a reread of the master key list, but 
that's not very elegant either.

I think you may be right that there is another potential issue if 
update_princ_encryption and purge_mkeys are propagated to a replica 
KDC too quickly, particularly if that happens via full dump, but I 
haven't worked out details.  Ideally I would like to untangle any 
problems there from this issue and address it in a separate ticket.

You implied that you observed a kpropd crash when the master KDC 
becomes non-functional.  That would be a third potential bug.  Can 
you confirm that the process actually stopped running, and perhaps 
produce a core file with a backtrace?  (I can also try to reproduce 
that failure myself.)