Rate limiting Kerberos Requests
Jack Neely
jjneely at ncsu.edu
Thu Sep 27 11:17:45 EDT 2012
On Wed, Sep 26, 2012 at 04:04:25PM -0500, Nico Williams wrote:
> On Wed, Sep 26, 2012 at 1:25 PM, Jack Neely <jjneely at ncsu.edu> wrote:
> > After spending some quality time with my logs, I do about 1.3 million
> > kerberos requests a day or 960/min on average. The incident that took
> > out the kerberos servers with an additional 600 hits/min (from the krb
> > logs) doesn't even make a spike on my graphs. My late morning usage is
> > higher.
>
> I'm not sure I understood correctly what the incident's symptoms were.
> If the symptom was non-responsiveness for a second then it's very
> likely (very, very, very likely) that the patches I mentioned earlier
> will solve your problem, and the events to correlate the incident to
> would be kadmind / kadmin.local / kdb5_util load / kpropd iprop events
> -- the longer these events the more likely that the kdc ends up
> sleeping for a second at a time.
This definitely seems to explain the lag in responses I've noticed
during a kprop operation. Usually I get a response in under a second,
but if I hit my KDC during when its receiving a kprop it can be 4 or 5
seconds.
The above incident is a single misbehaving client suddenly doing about
600 requests / minute for around 30 minutes. During this window no one
else could get a KDC response before the client timed out.
I've also noticed that the 1.6.1 version in RHEL 5 is leaking memory. I
think I've found my smoking gun here. Large memory consumption is
directly related to slower performance in my testing.
Thanks a bunch for the pointer to the patch!
Jack Neely
> The bug -if I'm right that it is the bug affecting you- is that
> between 1.5 and 1.10, inclusive, all versions of MIT krb5 used
> non-blocking file locking with a three-re-try loop with a 1-second
> sleep each go around. This is disastrous, really, but it only bites
> when something holds an exclusive lock on the KDB, which would be the
> daemons/tools listed above, and since the amount of time spent holding
> an exclusive lock on the KDB is generally (always, if you don't use
> the kadmin.local lock command) short, you might well be getting lucky
> 99.99% of the time and thus not observing any 1- or more second
> outages on your KDCs.
>
> If multiple KDCs are affected at roughly the same time then I'd
> suspect iprop. What is the rate of write transactions on your master?
> Do the rates of KDC (read) vs. kadm5srv (write) transactions imply
> the rate of outages you're experiencing?
>
> Nico
> --
--
Jack Neely <jjneely at ncsu.edu>
Linux Czar, OIT Campus Linux Services
Office of Information Technology, NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89
More information about the Kerberos
mailing list