Ticket 5338: Race conditions in key rotation

Mon Jun 23 17:47:24 EDT 2008

On Jun 19, 2008, at 01:59, Roland Dowdeswell wrote:
> What happens when a slave gets a little corrupted? How about some
> level of DB corruption?  Truncated database?  Configuration errors?
> What if a misconfiguration causes the master to fail to propagate
> to the slave for some period of time?

If the slave KDC notices corruption it should either report a server  
error that causes the client to keep trying other KDCs, or it should  
just stay silent.  A little like you might get with DNS.

If a DNS server has old data, or quietly corrupted data, what do you  
do about it?
Do you implement all your applications or client libraries to check  
other DNS servers for the same domain if you get back no-such-host  
errors, just in case the data is out of date?  What if you get back an  
address, but it's the wrong one?  Should Firefox try multiple DNS  
servers, and try connecting to web servers at all the addresses  
returned (versus, say, looking for the first reachable address from  
the list provided by one DNS server)?

There are probably various things we could to try to make the MIT  
implementation more resistant to corruption in slave database.  Maybe  
have the master double-check the number of records in the dump file  
against some heuristic before transmitting via kprop, or have an extra  
record in the database (which would have to be kept in sync!) with a  
count of the number of principals.  Maybe the slave should go offline  
if the database hasn't been updated in some configurable amount of  
time.  Maybe we should provide scripts (anyone want to contribute?)  
that email the admins or sound other alarms if database propagation  
fails for longer than a certain amount of time.  Maybe slave update  
time should be made accessible via SNMP for query by nagios or other  
monitoring tools.

If the slave database gets corrupted but the master is okay, ideally  
that's an issue that should be fixed in the next propagation.  If the  
corruption is in the master database, falling back to the master won't  
help.  Most of the times I've seen database corruption bugs, they've  
been in the random updates done on the master, not loading a database  
on a slave.  Though one or two have resulted in incorrect dump files  
for propagation that omit or duplicate entries, while the master can  
still access all the entries by name, at least for a little while  
longer.

>  What if kpropd crashes?

Then any other slave should have accurate data; we don't need to jump  
to the master KDC, when we could distribute the load.

>  What
> if we put a slave into krb5.conf before we bring it live?

With the KDC process not running, the client will get no response (or  
a port-unreachable error), and will move on to another server in  
fairly short time.

If the KDC process is started up with an empty database, the KDC  
wouldn't be able to handle the client's AS requests, or decrypt the  
TGS requests, and we might have a problem.  Though I'm not sure where  
to draw the line between administrator screw-ups we should try to  
compensate for, and administrator screw-ups (or malicious actions)  
where we should just throw up our hands and say, "you need to fix this  
first".

>  What if
> a KDC from the wrong realm gets mis-set in our configuration?

Interesting... I'm not certain, but I can guess:  For an AS-REQ, it  
probably would get an error saying the client principal was unknown.   
For a TGS-REQ, the (wrong) KDC wouldn't have a database entry for the  
indicated (TGS) principal, and would probably return a KRB-ERROR  
saying so.

But if you're going to presuppose configuration errors, why assume  
that the listing for the master KDC will be correct?

Also, in this case, we could presumably succeed by trying *any* other  
KDC, i.e., ignore the error and continue with our existing heuristics  
(that can be tweaked in site-specific ways via plugins to order KDCs  
by proximity).

I'd certainly support changes to better detect certain configuration  
errors, or make it harder for them to cause problems, but I don't  
think the blanket response to anything amiss should be to double the  
network traffic and focus all the extra traffic on one server.

>   Etc.

What if the slave KDC doesn't have the latest update regarding the  
authorization data a user should get for service X.  If service X  
reports an authorization error, should we go back and talk to the  
master KDC just in case?

> Failures on slaves should not be considered to be permanent errors
> given that we know up front that there is a more canonical source
> of truth just a few lines away in the configuration file...

I'm sure there's any number of failures relating to an out-of-date  
database that could result in problems for the client, some of which  
may be noticed in the Kerberos protocol exchange, and some may not be  
noticed until later.

All the problems relating to corruption, client or DNS configuration  
errors, etc., can be addressed by talking to another KDC; there's no  
need to use the master KDC specifically, and in some environments  
there may be a good reason not to when another slave is available.   
It's only when changes on the master haven't propagated that going  
directly to the master is the right answer.

In the long term, I'd rather see the synchronization problem addressed  
well, so that the clients don't have to do any of this sort of thing.   
Incremental propagation support will help, though Sun's implementation  
that I'm looking at integrating uses a periodic polling model, not an  
instantaneous push model.  However it should cut down the delay needed  
for propagation, so you can say, e.g., "after a new service is  
registered, you may have to wait 30 seconds for the information to  
propagate before you can authenticate to it".  (Actually, in the long  
term, I'd like to see a multi-master type arrangement where we don't  
even have a distinguished "master KDC".  But failing that, immediate  
propagation of all changes seems like a very desirable substitute.)

In the short term ... yeah, maybe some "reasonable" failure cases  
don't have sufficient information, and we should consider retrying for  
now.  But if we do this, I really hope we consider it a temporary  
workaround we intend to dispense with, or at least disable by default,  
in a few releases.  And I'm still not convinced it should be the rule  
rather than the exception.

> Incremental Propagation
> -----------------------
>
> I've been considering just using rsync instead.  Obviously, one
> would need to lock the DB, copy it to a tmp file and then rsync it
> into its final location on the slaves.  It seems a lot simpler than
> maintaining our own code to do it and is probably optimal `enough'.

I've long suspected that for a large database with infrequent changes,  
it would be much more efficient than kprop.  Had rsync been around  
when kprop was written, it probably would've been the way to go.  If  
you try it, do let us know if you encounter any interesting issues.

Ken