Database locking during kprops, MIT 1.8

Thu Oct 7 18:47:50 EDT 2010

Hi Dominic,

I would recommend having a look at iprop. The update mechanism has less impact than yours.

It checks the log timestamps and when it notices a difference between two sysatme it propagates only differences in the logs rather than the whole database. This should be a lot less than 12 secs for you, so your timeout problem should disappear.

The provisos I see are:
1. profile changes do not appear to be logged and propagated via iprop.
2. occasionally iprop gets lost and decides to do a full propagation, for that scenario you will get your timeouts, but it will be a lot less frequent than what you are currently getting.
3. it is documented as losing connectivity  occasionally and so may need to be restarted.

I am currently in the process of putting this into production and my testing has had pleasing results. I have only been able to force full propagation by generating much heavier loads than we see in our environment. I have not seen 3 at all and 1 is not really an issue for us, it is just something I noticed in testing.

You could probably mitigate 1  and 2 by a full propagation independently at a quiet time once a night, week or whatever.

You could mitigate 3 by monitoring the logs and restarting occasionally, probably with a script using the new utility kproplog.

Regards,

Jeremy

On 7/10/2010 10:54 PM, Dominic Hargreaves wrote:
> [safeTgram (safetgram-in) receive status: NOT encrypted, NOT signed.]
>
>
> We've just upgraded our master KDC from 1.4 to 1.8, and are observing
> around 2% of all password attempts fail with "Cannot lock database"
> returned to the user. I'd appreciate any thoughts on how to improve
> this situation. A rather lengthy discussion follows...
>
> Our setup is such that we have a once every five minute kdb5_util dump
> for onwards kprop propagation to the slaves. The detail of that dump
> and prop is very similar to that recommended in the 1.8 installation
> guide, although for compatibility with our 1.6 slaves we are using the
> '-r13' option to kdb5_util (I'll describe how it differs later, but this
> doesn't alter the main point).
>
> I should point out that previous to our upgrade we had occasional
> problems where database updates sometimes took longer than the iptables
> state tracking timeout, which resulted in even worse problems (where the
> update succeeded but the kadmin/kpasswd client received an error).
>
> The new behaviour is definitely desirable, in that a larger number
> of errors occur but the error messages actually match reality. But
> there's still room for improvement.
>
> With 1.8, I can see that there is a fixed number of retries defined
> in src/plugins/kdb/db2/kdb_db2.c (5, 1 second apart) which tallies
> exactly with our logs (requests coming in 5 seconds or less prior to
> the end of the dump proceed okay). This is incidentally the same
> interval/number as in 1.4's krb5_db2_db_put_principal, so I'm not
> sure why we saw the iptables timeouts based on this analysis. But I
> digress...
>
> Since our database dumps currently take around 12 seconds, I estimate
> that if we change that number of retries from 5 to 15 we'd almost
> completely eliminate this problem without introducing unacceptable
> delays... until our database grows again.
>
> Anyhow, I wonder whether we're doing something particularly odd here;
> we'd obviously like to reduce or completely eliminate users getting
> this message, but recompiling to change that #define seems wrong.
>
> We'd like to move to incremental propagation, ultimately, but this
> would mean moving our slaves to 1.8 which isn't ideal for us at the
> moment.
>
> We have around 55,000 principals and a database size of around 150MB.
>
> Oh, and one final note: another part of the reason this appears to
> hit us more with 1.8 is because our dump-and-prop is done via a
> Makefile which only dumps the database if the previous dumpfile is
> older than the principal database (via a simple Makefile dependency).
> With 1.8, it looks like (some?) getprinc requests also end up modifying
> the principal database mtime (log correlation suggests that not all
> getprincs have this effect, and there is a lag of several seconds; but
> that's the best idea I've got). I can't spot immediately what in the
> code is doing this; any ideas?
>
> Thanks for reading!
>

-- 

"The whole modern world has divided itself into Conservatives and Progressives. The business of Progressives is to go on making mistakes. The business of the Conservatives is to prevent the mistakes from being corrected." -- G. K. Chesterton