Ticket 5338: Race conditions in key rotation
raeburn at MIT.EDU
Mon Jun 23 17:47:24 EDT 2008
On Jun 19, 2008, at 01:59, Roland Dowdeswell wrote:
> What happens when a slave gets a little corrupted? How about some
> level of DB corruption? Truncated database? Configuration errors?
> What if a misconfiguration causes the master to fail to propagate
> to the slave for some period of time?
If the slave KDC notices corruption it should either report a server
error that causes the client to keep trying other KDCs, or it should
just stay silent. A little like you might get with DNS.
If a DNS server has old data, or quietly corrupted data, what do you
do about it?
Do you implement all your applications or client libraries to check
other DNS servers for the same domain if you get back no-such-host
errors, just in case the data is out of date? What if you get back an
address, but it's the wrong one? Should Firefox try multiple DNS
servers, and try connecting to web servers at all the addresses
returned (versus, say, looking for the first reachable address from
the list provided by one DNS server)?
There are probably various things we could to try to make the MIT
implementation more resistant to corruption in slave database. Maybe
have the master double-check the number of records in the dump file
against some heuristic before transmitting via kprop, or have an extra
record in the database (which would have to be kept in sync!) with a
count of the number of principals. Maybe the slave should go offline
if the database hasn't been updated in some configurable amount of
time. Maybe we should provide scripts (anyone want to contribute?)
that email the admins or sound other alarms if database propagation
fails for longer than a certain amount of time. Maybe slave update
time should be made accessible via SNMP for query by nagios or other
If the slave database gets corrupted but the master is okay, ideally
that's an issue that should be fixed in the next propagation. If the
corruption is in the master database, falling back to the master won't
help. Most of the times I've seen database corruption bugs, they've
been in the random updates done on the master, not loading a database
on a slave. Though one or two have resulted in incorrect dump files
for propagation that omit or duplicate entries, while the master can
still access all the entries by name, at least for a little while
> What if kpropd crashes?
Then any other slave should have accurate data; we don't need to jump
to the master KDC, when we could distribute the load.
> if we put a slave into krb5.conf before we bring it live?
With the KDC process not running, the client will get no response (or
a port-unreachable error), and will move on to another server in
fairly short time.
If the KDC process is started up with an empty database, the KDC
wouldn't be able to handle the client's AS requests, or decrypt the
TGS requests, and we might have a problem. Though I'm not sure where
to draw the line between administrator screw-ups we should try to
compensate for, and administrator screw-ups (or malicious actions)
where we should just throw up our hands and say, "you need to fix this
> What if
> a KDC from the wrong realm gets mis-set in our configuration?
Interesting... I'm not certain, but I can guess: For an AS-REQ, it
probably would get an error saying the client principal was unknown.
For a TGS-REQ, the (wrong) KDC wouldn't have a database entry for the
indicated (TGS) principal, and would probably return a KRB-ERROR
But if you're going to presuppose configuration errors, why assume
that the listing for the master KDC will be correct?
Also, in this case, we could presumably succeed by trying *any* other
KDC, i.e., ignore the error and continue with our existing heuristics
(that can be tweaked in site-specific ways via plugins to order KDCs
I'd certainly support changes to better detect certain configuration
errors, or make it harder for them to cause problems, but I don't
think the blanket response to anything amiss should be to double the
network traffic and focus all the extra traffic on one server.
What if the slave KDC doesn't have the latest update regarding the
authorization data a user should get for service X. If service X
reports an authorization error, should we go back and talk to the
master KDC just in case?
> Failures on slaves should not be considered to be permanent errors
> given that we know up front that there is a more canonical source
> of truth just a few lines away in the configuration file...
I'm sure there's any number of failures relating to an out-of-date
database that could result in problems for the client, some of which
may be noticed in the Kerberos protocol exchange, and some may not be
noticed until later.
All the problems relating to corruption, client or DNS configuration
errors, etc., can be addressed by talking to another KDC; there's no
need to use the master KDC specifically, and in some environments
there may be a good reason not to when another slave is available.
It's only when changes on the master haven't propagated that going
directly to the master is the right answer.
In the long term, I'd rather see the synchronization problem addressed
well, so that the clients don't have to do any of this sort of thing.
Incremental propagation support will help, though Sun's implementation
that I'm looking at integrating uses a periodic polling model, not an
instantaneous push model. However it should cut down the delay needed
for propagation, so you can say, e.g., "after a new service is
registered, you may have to wait 30 seconds for the information to
propagate before you can authenticate to it". (Actually, in the long
term, I'd like to see a multi-master type arrangement where we don't
even have a distinguished "master KDC". But failing that, immediate
propagation of all changes seems like a very desirable substitute.)
In the short term ... yeah, maybe some "reasonable" failure cases
don't have sufficient information, and we should consider retrying for
now. But if we do this, I really hope we consider it a temporary
workaround we intend to dispense with, or at least disable by default,
in a few releases. And I'm still not convinced it should be the rule
rather than the exception.
> Incremental Propagation
> I've been considering just using rsync instead. Obviously, one
> would need to lock the DB, copy it to a tmp file and then rsync it
> into its final location on the slaves. It seems a lot simpler than
> maintaining our own code to do it and is probably optimal `enough'.
I've long suspected that for a large database with infrequent changes,
it would be much more efficient than kprop. Had rsync been around
when kprop was written, it probably would've been the way to go. If
you try it, do let us know if you encounter any interesting issues.
More information about the krbdev