special ccache performance issue
Greg Hudson
ghudson at mit.edu
Mon May 13 11:07:38 EDT 2019
On 5/13/19 3:22 AM, Wang Jian wrote:
> When using ansible with kerberos for thousands of targets, there is a
> serious ccache performance issue.
Agreed.
> Using file ccache (DIR:)
> - from a cold ccache, running simple script on servers is fast, at 500-700
> hosts/min with 2 or 4 concurrent ansible instance. But things change when
> ccache has over 5000 host tickets. The speed drops to 10-30/min and sys CPU
> keeps very high.
> - High file lock intesion which consumes nearly all CPU
One small improvement we know we could make is to stop locking file
ccaches for reads, since it's an append-only format. (We would have to
ignore truncated records at the end when reading, instead of erroring
out.) This would only help a little bit, since the real problem is
O(n^2) performance.
A more ambitious possibility is to write a config entry into the cache
which acts as a hash table for service tickets, while old
implementations will read past it. (On hash collision, simply overwrite
the old ticket.) For resource and complexity reasons I'm not sure that
could be implemented any time soon.
> Using kernel keyring ccahe
> - fast from start, but eventually, continuous failure, and high sys CPU
> - from klist -a, the output is empty now and then, which indicates that
> keyring has kneed down under pressure
Users are only allowed to use a certain amount of keyring space, so
maybe you're running into this. (It has been argued that the keyring
ccache type should simply store an encryption key which is then used for
a file-based ccache, and that could be done with a new name. But then
you're back to needing a high-performance file-based ccache type.)
> Using Heimdal KCM
> - didn't try. Heimdal KCM uses sequential algorithm and single lock
SSSD also has a KCM server implementation, though I don't know much
about its performance characteristics. Regardless, the KCM client
(Heimdal and MIT krb5) iterates over the ccache for get_principal,
making O(n) behavior impossible. Heimdal's KCM server has a
currently-unused get_principal operation which makes its own TGS request
when a credential isn't found (similar to the Microsoft LSA ccache on
Windows); I am undecided on whether that's desirable behavior.
> I know this is a special case, but perhaps it should be addressed.
It seems to be a rare case, but it has come up before.
On the user end, you can work around this by manually swapping out the
ccache for each request, either to a copy of the initial ccache
(sacrificing caching) or to a per-target-host ccache. That's obviously
a lot of work you shouldn't have to do at a minimum, and could be
impractical in some cases.
We've talked about an environment variable which suppresses caching, but
it hasn't been implemented in either MIT krb5 or Heimdal.
More information about the Kerberos
mailing list