rcache fsync() avoidance

Nico Williams nico at cryptonector.com
Mon Sep 1 17:59:52 EDT 2014

One of the things that makes the traditional MIT rcache design painfully
slow is the use of fsync() in each operation that doesn't detect a
replay (the common case).

We all know that RFC4120 recommends a 10 minute skew window: 5 minutes
into the past, 5 into the future.  But the skew window can be different,
and implementations tend to make it configurable.

fsync() avoidance is simply about allowing part of the skew window to be
dynamically determined for a few minutes after boot time.

The fsync() avoidance rules are:

 - fsync() when writing rcache entries for Authenticators with
   timestamps "far enough" into the future, but not otherwise;

 - at boot time pick a time T_0 such that such that any Authenticators
   with a timestamp after T_0 would have triggered an fsync() if played
   before the boot event;
 - reject any Authenticators whose timestamps are before T_0

T_0 must be between T_crash and T_ready, where T_crash is the time of
crash or shutdown, and T_ready is the time at which the system is ready
to service clients.

T_crash is often impossible to determine, so it's best to estimate it as
T_boot.  T_ready can be estimated as T_boot + .5 * average time to boot.
For many systems a decent guesstimate can be something like

  T_ready = T_boot + 3s  /* guesstimate */

The key concept here is that the skew window need not be static, much
less +-5m.  Here it starts at close to [now, now + 5m] at boot time
and grows to [now - 5m, now + 5m] at the rate of 1s/s, so that at T_boot
+ 5m the system is back to normal.

This is inspired by thinking about placing the rcache in tmpfs (where
fsync() is a no-op): the obvious thing to do at boot time would be to
reject any Authenticator timestamps less than T_boot + 5m, whis is like
saying that at T_boot the skew window is [now + 5m, now +5m], and grows
second by second to the normal [now - 5m, now + 5m].  Obviously that's
not acceptable, operationally, which leads to placing the rcache on
stable storage.

Some clients will be rejected that wouldn't have been but for the
reboot, but many, many fewer than in the rcache-on-tmpfs case.


More information about the krbdev mailing list