credentials caching mechanism, ssh gssapi-with-mic

Tue Jul 1 12:34:42 EDT 2014

We use an internally developed job-dispatching system, which is
implicitly built on Kerberos.  Jobs are basically dispatched via “ssh
servername command”.  Furthermore, the jobs need to access NFSv4
shares mounted with the “sec=krb5p” option.  To facilitate this, the
ssh client and daemon need to be configured with “GSSAPIAuthentication
yes”, and the client additionally needs to be configured with
“GSSAPIDelegateCredentials yes”.

So, let’s say I log in to server “master”.  I run “klist” and see I have my TGT:

master$ klist
Ticket cache: FILE:/tmp/krb5cc_504_DzBEBk
Default principal: matt at REALM

Valid starting     Expires            Service principal
07/01/14 11:03:18  09/09/14 11:03:18  krbtgt/REALM at REALM
        renew until 08/10/14 11:03:18

Now I ssh into one of the “slave” servers, then log back out, and
re-run “klist” from master.  I now additionally have a host ticket for
the “slave” server:

master$ klist
Ticket cache: FILE:/tmp/krb5cc_504_DzBEBk
Default principal: matt at REALM

Valid starting     Expires            Service principal
07/01/14 11:03:18  09/09/14 11:03:18  krbtgt/REALM at REALM
        renew until 08/10/14 11:03:18
07/01/14 11:04:35  09/09/14 11:03:18  host/slave1.domain at REALM
        renew until 08/10/14 11:03:18

Nothing unusual or surprising so far.  Now, let’s say that particular
slave server is rebuilt (OS wiped, re-installed, re-configured).  Note
that the rebuilding process involves re-generating the host keytab
file (“kadmin -q ‘ktadd host/slave1.domain’ ; kadmin -q ‘ktadd
nfs/slave1.domain’”).  As far as I can tell, re-creating the keytab
file causes the key version number (“KVNO”) to be incremented.

Now, when I try to login to “slave1” from “master” via ssh using
gssapi-with-mic, it fails.  Verbose sshd logging on “slave1” tells the
story:

Jul  1 09:49:21 slave1 sshd[31236]: debug1: Unspecified GSS failure.
Minor code may provide more information\nKey version number for
principal in key table is incorrect\n
Jul  1 09:49:21 slave1 sshd[31236]: debug1: Got no client credentials

What I believe is happening is that the credentials cache on “master”
now contains an old/stale/invalid host ticket for “slave1”.

Any easy workaround is of course to do a “kdestroy ; kinit” to clear
out that stale host entry.  However, the above is over-simplified;
user’s will have dozens of host entries for all their slave machines,
and we don’t want to clear out those cache credentials (only the
offending ones).  As far as I can tell, there does not appear to be
any way to *selectively* remove cached tickets.  Or am I missing
something?

Also: I’m a little unclear on exactly how credentials caching works.
I get the impression that there is some kind of in-memory caching (at
the kernel level?) that doesn’t show up in klist.  For example, say
someone logs into a server (using ssh gssapi-with-mic), launches a
program that needs to access NFSv4 sec=krb5p shares, and then closes
that session.  The job stays running---for a while.  After some time
(seems to be on the order of 8--10 hours), access to the NFSv4 share
fails.  But in this case, there is no /tmp/krb5cc* file for that
particular user… so clearly there is some kind of credentials caching
going on, but where is it?  And how long does it last?

Kerberos clients are running CentOS (RHEL) 5.7, Kerberos server is
running CentOS 6.5.  Kerberos version is “1.6.1-62” (part of CentOS
distribution, same Kerberos version on both OS versions).

Ultimately, I’m trying to get a better handle on exactly how
credentials caching works.  I anticipate more scenarios where the
subtleties of this mechanism will need to be known.

Thanks!
Matt