Gss context refresh failure due to clock skew

Mon Oct 5 16:34:42 EDT 2015

> On Oct 5, 2015, at 4:02 PM, Greg Hudson <ghudson at MIT.EDU> wrote:
> 
> On 10/05/2015 03:35 PM, Adamson, Andy wrote:
>>> I think this case doesn't arise often because people don't often set
>>> maximum service ticket lifetimes to be shorter than maximum TGT
>>> lifetimes.  
>> 
>> Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
>> 
>> We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.
> 
> If the issue is not caused by short-lifetime service principals,

I was wrong - you are right, it is caused by service ticket lifetimes being shorter than TGT lifetimes.

I didn’t know setting the service ticket lifetimes to not be less than TGT lifetimes was a requirement. Neither does NetApp QA and I suspect, neither do customers in general.

> then
> the test scenario you described isn't representative of the real
> scenario.  To reproduce the problem as it manifests in your IO tests,
> you will need to adjust the TGT lifetime down to ten minutes as well as
> the nfs/server lifetime.

Code was added to rpc.gssd, the NFS client agent that creates GSS contexts for NFS, to take into account the clock skew and get a new TGT before (now+clock skew). So if the service ticket lifetime is equal to or greater than the TGT lifetime, then all is well.

> 
>>> If the TGT itself has expired or is about to expire, some
>>> out-of-band agent needs to refresh the TGT somehow, and it doesn't
>>> matter all that much whether the failure comes from the client or the
>>> server.
>> 
>> I thought that having a keytab entry and a renewable TGT was enough.
> 
> I'm not sure why you would do both of these; if you're getting initial
> creds with a keytab, there is no need to muck around with ticket renewal.

I wouldn’t, but QA and customers do.

> 
> Anyway, gss_init_sec_context() never renews tickets, and only gets
> tickets from a keytab when a client keytab is configured (new in 1.11).
> When tickets are obtained using a client keytab, they are refreshed
> from the keytab when they are halfway to expiring,

refreshed by…?

> so this clock skew
> issue should not arise, so I don't think that feature is being used.
> 
> It is possible that the NFS client code has its own separate logic for
> obtaining new tickets using a keytab.  

When an NFS request requires a GSS context, if the context does not exist, is not valid, or if it is valid but the server replies to an RPC request using a GSS context with an RPC error that indicates it’s side of the GSS context has a problem, the client kernel does an upcall to rpc.gssd which then decides if a new service ticket is required to send an RPCSEC_GSS_INIT message to the server to create a new GSS context. The resultant GSS context is stored in the client kernel with a lifetime equal to the service ticket used to create it.

If rpc.gssd calls the code that refreshes the tickets from the keytab when they are half way to expiring’ then that should mitigate the clock skew issue.

> If so, we need to understand how
> it works.  It's possible (though unlikely) that changing the behavior of
> gss_accept_sec_context() wouldn't be sufficient by itself.