Ticket 5338: Race conditions in key rotation

Tue Jun 24 21:56:36 EDT 2008

On 1214346799 seconds since the Beginning of the UNIX epoch
Nicolas Williams wrote:
>

>On Tue, Jun 24, 2008 at 01:56:20PM -0400, Roland Dowdeswell wrote:
>> An example of a case which incremental propagation does not not
>> mitigate is changing your TGS key if you round robin between KDCs
>> in a random order.  If you get kvno 7 from the first slave and then
>> present it to another slave which has only kvno 6 then you will
>> get a failure.  A lot of environments will use the TGT immediately
>> after it is obtained in order to get AFS tokens.  So, your window
>> is a few milliseconds.
>
>IMO the correct way to handle this is to first add a new key _disabled_
>TGTs but it's still available for decrypting TGTs encrypted in that key,
>wait for replication, then mark the key enabled and replicate.
>
>This way when the first KDC (the master) starts using the new TGS key
>all the other KDCs will have it already and will be able to decrypt the
>new TGTs.

Sure, that was my first thought as well.  It is the obvious answer.
When I sat down and thought about it for quite a while, though,
that didn't seem as good as failing back to the master.  Perhaps
one could get a better system with a hybrid model...

And a rather important point is that I am specifically talking about
client behaviour here.  Not KDC behaviour.  Let's not make the mistake
of assuming that either:

	1.  the client libraries versions are in sync with
	    the KDC version, or

	2.  that this decision should be made by the KDC
	    infrastructure.

So, as a client here is your flow:

	1.  you got a TGT with kvno 7 from a slave,

	2.  there was mutual auth involved---so you know that you
	    got it from a _real_ slave,

	3.  you present it to another slave and it does not work,

	4.  now, you _know_ that there exists a slave for which this
	    request would actually work,

	5.  and yet you decide to fail...

Given 4, why exactly does 5 make sense?  By this, I mean:  ``please
provide me with a justification as to why I should suffer a production
outage which might cost a substantial amount of money because a
client decided to fail when it can be demonstrated that it had or
should have had enough information to avert the problem''.

I do not think that this logic changes substantially if you attempt
to keep the KDCs in sync.  You might still have problems and a
relatively inexpensive mechanism either failing back to the master
to trying another slave would improve the odds of success.  For
each step in the process, care should be taken to ensure that simple
steps to improve the odds are actually taken.

--
    Roland Dowdeswell                      http://www.Imrryr.ORG/~elric/