Database locking during kprops, MIT 1.8

Dominic Hargreaves dominic.hargreaves at oucs.ox.ac.uk
Thu Oct 7 07:54:18 EDT 2010


We've just upgraded our master KDC from 1.4 to 1.8, and are observing
around 2% of all password attempts fail with "Cannot lock database"
returned to the user. I'd appreciate any thoughts on how to improve
this situation. A rather lengthy discussion follows...

Our setup is such that we have a once every five minute kdb5_util dump
for onwards kprop propagation to the slaves. The detail of that dump
and prop is very similar to that recommended in the 1.8 installation
guide, although for compatibility with our 1.6 slaves we are using the
'-r13' option to kdb5_util (I'll describe how it differs later, but this
doesn't alter the main point).

I should point out that previous to our upgrade we had occasional
problems where database updates sometimes took longer than the iptables
state tracking timeout, which resulted in even worse problems (where the
update succeeded but the kadmin/kpasswd client received an error).

The new behaviour is definitely desirable, in that a larger number
of errors occur but the error messages actually match reality. But
there's still room for improvement.

With 1.8, I can see that there is a fixed number of retries defined
in src/plugins/kdb/db2/kdb_db2.c (5, 1 second apart) which tallies
exactly with our logs (requests coming in 5 seconds or less prior to
the end of the dump proceed okay). This is incidentally the same
interval/number as in 1.4's krb5_db2_db_put_principal, so I'm not
sure why we saw the iptables timeouts based on this analysis. But I
digress...

Since our database dumps currently take around 12 seconds, I estimate
that if we change that number of retries from 5 to 15 we'd almost
completely eliminate this problem without introducing unacceptable
delays... until our database grows again.

Anyhow, I wonder whether we're doing something particularly odd here;
we'd obviously like to reduce or completely eliminate users getting
this message, but recompiling to change that #define seems wrong.

We'd like to move to incremental propagation, ultimately, but this
would mean moving our slaves to 1.8 which isn't ideal for us at the
moment.

We have around 55,000 principals and a database size of around 150MB.

Oh, and one final note: another part of the reason this appears to
hit us more with 1.8 is because our dump-and-prop is done via a
Makefile which only dumps the database if the previous dumpfile is
older than the principal database (via a simple Makefile dependency).
With 1.8, it looks like (some?) getprinc requests also end up modifying
the principal database mtime (log correlation suggests that not all
getprincs have this effect, and there is a lag of several seconds; but
that's the best idea I've got). I can't spot immediately what in the
code is doing this; any ideas?

Thanks for reading!

-- 
Dominic Hargreaves, Systems Development and Support Team
Computing Services, University of Oxford



More information about the Kerberos mailing list