Ticket 5338: Race conditions in key rotation

Tue Jun 24 02:30:11 EDT 2008

On 1214257644 seconds since the Beginning of the UNIX epoch
Ken Raeburn wrote:
>
>On Jun 19, 2008, at 01:59, Roland Dowdeswell wrote:
>> What happens when a slave gets a little corrupted? How about some
>> level of DB corruption?  Truncated database?  Configuration errors?
>> What if a misconfiguration causes the master to fail to propagate
>> to the slave for some period of time?
>
>If the slave KDC notices corruption it should either report a server  
>error that causes the client to keep trying other KDCs, or it should  
>just stay silent.  A little like you might get with DNS.

Yes, in a perfect world.  I'm not convinced that is what we're in.

>If a DNS server has old data, or quietly corrupted data, what do you  
>do about it?
>Do you implement all your applications or client libraries to check  
>other DNS servers for the same domain if you get back no-such-host  
>errors, just in case the data is out of date?  What if you get back an  
>address, but it's the wrong one?  Should Firefox try multiple DNS  
>servers, and try connecting to web servers at all the addresses  
>returned (versus, say, looking for the first reachable address from  
>the list provided by one DNS server)?

I do not think that the DNS example is a good mirror of how a
Kerberos infrastructure should work.  For one, the problem set that
is trying to be addressed is quite different, the general use cases
are different.

Given that in general, when you are requesting service tickets,
you have already not only successfully located the server but also
built up a TCP connection to it, I think that you have a much firmer
idea that your request for a service ticket _should_ succeed.  If
you simply request a random DNS name based on user input then you
do not.

>There are probably various things we could to try to make the MIT  
>implementation more resistant to corruption in slave database.  Maybe  
>have the master double-check the number of records in the dump file  
>against some heuristic before transmitting via kprop, or have an extra  
>record in the database (which would have to be kept in sync!) with a  
>count of the number of principals.  Maybe the slave should go offline  
>if the database hasn't been updated in some configurable amount of  
>time.  Maybe we should provide scripts (anyone want to contribute?)  
>that email the admins or sound other alarms if database propagation  
>fails for longer than a certain amount of time.  Maybe slave update  
>time should be made accessible via SNMP for query by nagios or other  
>monitoring tools.

There are various things that you could do.  I would not be terribly
excited about slaves programatically deciding to go offline, however,
just because they haven't been updated in some amount of time.  In
the event of a network partition that could be disastrous.  I would
rather have them continue to serve stale data, detect the situation,
deal with it manually, and in the meantime have all of the clients
do something reasonable when that slave returns errors.

Again, my point is that I would trust the logic that we put into
the slaves less than a simple fail back to the master.  The failback
ensures that all we'll get is an insignificant extra network/KDC
load on the master while we have time to address the issue.

Or more to the point, how do I answer the question:

	What is a reasonable configuration for the time after which
	a slave will automatically go offline if it has not received
	an update?

I don't think that has a good answer.  It depends a lot on what
kind of disasters you expect.  It also has some unfortunate
dependencies on the disasters that you do not expect.  Those are
generally the ones that actually occur in practice.

>If the slave database gets corrupted but the master is okay, ideally  
>that's an issue that should be fixed in the next propagation.  If the  
>corruption is in the master database, falling back to the master won't  
>help.  Most of the times I've seen database corruption bugs, they've  
>been in the random updates done on the master, not loading a database  
>on a slave.  Though one or two have resulted in incorrect dump files  
>for propagation that omit or duplicate entries, while the master can  
>still access all the entries by name, at least for a little while  
>longer.

Granted, the master may be more likely to have some corruption issues.

>>  What if kpropd crashes?
>
>Then any other slave should have accurate data; we don't need to jump  
>to the master KDC, when we could distribute the load.

Sure, but how exactly do we determine the difference between:

	1.  kpropd crashed, and

	2.  new data exists on the master?

Given that both have the same symptoms, I don't think that we can.
So the safe path---the path that causes the least overall (network
and KDC) load---is to fail back to the master.

>>  What
>> if we put a slave into krb5.conf before we bring it live?
>
>With the KDC process not running, the client will get no response (or  
>a port-unreachable error), and will move on to another server in  
>fairly short time.
>
>If the KDC process is started up with an empty database, the KDC  
>wouldn't be able to handle the client's AS requests, or decrypt the  
>TGS requests, and we might have a problem.  Though I'm not sure where  
>to draw the line between administrator screw-ups we should try to  
>compensate for, and administrator screw-ups (or malicious actions)  
>where we should just throw up our hands and say, "you need to fix this  
>first".
>
>>  What if
>> a KDC from the wrong realm gets mis-set in our configuration?
>
>Interesting... I'm not certain, but I can guess:  For an AS-REQ, it  
>probably would get an error saying the client principal was unknown.   
>For a TGS-REQ, the (wrong) KDC wouldn't have a database entry for the  
>indicated (TGS) principal, and would probably return a KRB-ERROR  
>saying so.
>
>But if you're going to presuppose configuration errors, why assume  
>that the listing for the master KDC will be correct?

That was just an example.  I'm not presupposing that the master will
be correct but I think that it would not be difficult to make a case
that it is more likely that the master is correct at any given point
in time.

>Also, in this case, we could presumably succeed by trying *any* other  
>KDC, i.e., ignore the error and continue with our existing heuristics  
>(that can be tweaked in site-specific ways via plugins to order KDCs  
>by proximity).

Sure, but again, we already know that in general the master will
be correct.

>I'd certainly support changes to better detect certain configuration  
>errors, or make it harder for them to cause problems, but I don't  
>think the blanket response to anything amiss should be to double the  
>network traffic and focus all the extra traffic on one server.

I don't think that we're talking about an amazing amount of additional
load or traffic here.  So, let's set some expectations here on how
much additional load we're talking about with some data.  This is
the middle of the night, so I'm just going to look at one of our
KDCs rather than do a full analysis of the lot.  And, I'll limit
it to about five days.

The KDC processed 11941961 requests and returned 181659 errors.

(I'm ignoring PREAUTH_NEEDED and PREAUTH_FAILED as the former aren't
errors and the latter fail back to the master already.)

Maybe this warrants additional study as I didn't delve into the actual
causes of the errors or study a long enough period of time.

But: 181659 / 11941961 * 100 = 1.5%

I think that I'm willing to accept a 1.5% increase on the load of
the master and network in exchange for having a more robust
environment.

Also, I was suggesting that we default to failing back if we have
not analysed the error and determined that it is permanent.  That
is, if we prove to ourselves that a particular error is permanent,
then we should make it so.  But the default for errors that we have
not analysed should not be permanent.  An example of a permanent
errors might be `ticket not forwardable' or `ticket not renewable'.
Those do not depend on data in the slave's database and so we can
presume that if the slave successfully decrypted the ticket and
decided to return those errors then it is right.  A lot of the
other errors, however, are obviously things about which the slave
could simply be out of date: {server|client} not found, expired
passwords, decrypt intergrity check failed, client locked out, bad
encryption type, ...

>>   Etc.
>
>What if the slave KDC doesn't have the latest update regarding the  
>authorization data a user should get for service X.  If service X  
>reports an authorization error, should we go back and talk to the  
>master KDC just in case?

Different problem.  I'm talking about KDC errors at the moment.
There is always going to be a race here because the request for
access to resources must necessarily occur at some point after the
ticket is obtained from the KDC and presented to the server.  Also,
Kerberos caches tickets which exacerbates this issue.

>> Failures on slaves should not be considered to be permanent errors
>> given that we know up front that there is a more canonical source
>> of truth just a few lines away in the configuration file...
>
>I'm sure there's any number of failures relating to an out-of-date  
>database that could result in problems for the client, some of which  
>may be noticed in the Kerberos protocol exchange, and some may not be  
>noticed until later.
>
>All the problems relating to corruption, client or DNS configuration  
>errors, etc., can be addressed by talking to another KDC; there's no  
>need to use the master KDC specifically, and in some environments  
>there may be a good reason not to when another slave is available.   
>It's only when changes on the master haven't propagated that going  
>directly to the master is the right answer.

Sure.  I just think that you can't as the client generally determine
what the problem is.  The master is right there waiting to give
you the up to date notion of what the universe looks like, though.
You'll eventually have to get there for a large class of errors
before you are convinced that you're done with your quest.  So, it
might as well be next.

>In the long term, I'd rather see the synchronization problem addressed  
>well, so that the clients don't have to do any of this sort of thing.   
>Incremental propagation support will help, though Sun's implementation  
>that I'm looking at integrating uses a periodic polling model, not an  
>instantaneous push model.  However it should cut down the delay needed  
>for propagation, so you can say, e.g., "after a new service is  
>registered, you may have to wait 30 seconds for the information to  
>propagate before you can authenticate to it".  (Actually, in the long  
>term, I'd like to see a multi-master type arrangement where we don't  
>even have a distinguished "master KDC".  But failing that, immediate  
>propagation of all changes seems like a very desirable substitute.)

Incremental propagation does not solve race conditions.  It just
makes one runner a little faster but the underlying issue still
exists.

I'm not convinced that I would like to have a multi-master scheme,
it seems that it would add complexity for little additional value.

>In the short term ... yeah, maybe some "reasonable" failure cases  
>don't have sufficient information, and we should consider retrying for  
>now.  But if we do this, I really hope we consider it a temporary  
>workaround we intend to dispense with, or at least disable by default,  
>in a few releases.  And I'm still not convinced it should be the rule  
>rather than the exception.

--
    Roland Dowdeswell                      http://www.Imrryr.ORG/~elric/