rcache fsync() avoidance

Tue Sep 2 12:30:24 EDT 2014

On Tue, Sep 02, 2014 at 11:56:20AM -0400, Greg Hudson wrote:
> I think at this point we are prepared to just get rid of the fsync()
> calls in the MIT krb5 rcache implementation, and call that an
> implementation limitation.  My reasoning is:
> 
> * For most situations where replay caches help, they provide limited
> protection against active attacks anyway.  (Basically: if the protocol
> needs replay protection because it uses Kerberos for authentication
> only, an active attacker could modify the data stream or suppress the
> legitimate authentication to bypass the replay cache.  Replay caches
> only provide complete protection when the data stream is protected by
> the Kerberos authentication context, but without an acceptor subkey,
> such that an attacker could replay a complete session to cause an action
> to be executed twice.)

I would just... declare such protocols dangerous and not supported.
Full stop.

No more rsh/rlogin/telnet with authentication only.  Preferably no more
rsh/rlogin/telnet full stop.

Application protocols that could benefit from rcaches:

 - UDP loggers (non-mutual auth AP-REQ + KRB-SAFE / MIC)
 - UDP / SCTP apps generally

We could even deprecate non-mutual auth for Kerberos and use an rcache
only for PROT_READY tokens, or even document that PROT_READY token
replays are not detected until the first non-PROT_READY per-msg token is
processed by the server.  (PROT_READY token semantics are close enough
to that anyways.)

Then we can get rid of the rcache completely.

It's probably a bit too soon to go that far.  But we could discuss that
on the KITTEN WG list and see what happens.

> * The design you outline degrades into bad performance if either (1) the
> server has negative clock drift beyond the boot time estimate, or (2) a
> non-trivial fraction of clients have positive clock drift beyond the
> boot time estimate.  It can also cause spurious authentication failures
> shortly after boot, for clients with negative clock drift.

If you're not using NTP or alike then it's fair to expect problems!

In any case, we really need a multi-round-trip extension anyways, which
should be the longer term answer to this concern.

> * The probability of bad performance behavior increases as the boot time
> estimate approaches zero.  At some point in the future we might start to
> see VMs with sub-second reboot times, at which point even a 1s positive
> client clock drift would force an fsync() and even a 1s negative client
> clock drift could cause a spurious authentication failure shortly after
> a reboot.

This is quite true.  If the estimate is 3s but the real time to boot is
.5s you have a window of vulnerability, but that's hardly worse than
just never doing fsync()!  :)

Yes, a window of vulnerability a few seconds long would be enormous to
the right attacker, and really, these attacks never happen.  That's a
good reason to stop fsync()ing altogether.  But fsync()s might be more
relevant to other protocols, so at least documenting (done; this thread
can be it) fsync() avoidance might help someone else.

Nico
--