[krbdev.mit.edu #7547] RE: krb5-1.11 iprop bug

Mon Jan 14 14:05:08 EST 2013

It is a REAL bug.  However, I did not understand the circumstances
correctly.

If a full reload is required, the following sequence happens:

-        The dump is transmitted from the master to the slave

-        The ulog is initialized with the serial number information

-        The database is loaded into temporary files with a ~ filename

Issue: if the database load does not complete, the ulog serial number has
already been set, so the slave can run with an older database yet report
that it is current.

From: Richard Basch [mailto:basch at alum.mit.edu] 
Sent: Thursday, January 03, 2013 10:42 PM
To: 'krb5-bugs at mit.edu'
Subject: RE: krb5-1.11 iprop bug

I need to research further. the code doesn't look like this situation should
have been possible unless the "time jumped" or something happened in
parallel which should not be possible under normal circumstances.

I am beginning to suspect I lost track of the state and was flipping states
but had forgotten to quiesce a full propagation in-progress.

Ignore this bug report until I can find the issue and/or reproduce the
situation.

From: Richard Basch [mailto:basch at alum.mit.edu] 
Sent: Thursday, January 03, 2013 9:29 PM
To: 'krb5-bugs at mit.edu'
Subject: krb5-1.11 iprop bug

The new iprop dump / restore code has a significant bug (my patches could
not have contributed to this issue).

>From what I can discern, when a FULL RESYNC is required, the admin server
will check if a dump already exists with the serial/timestamp in the ulog
(so far so good).  However, I think the check must be flawed. I upgraded
from an earlier version and still had slave_datatrans_* files from before
with older entries.  Furthermore, I had restarted the ulog (since the stock
code doesn't preserve the ulog, I have to assume I might have to update from
a slave to a master and that will force a re-init of the ulog).  So, in
essence, even the last slave_datatrans file might have had a sno/timestamp,
but it shouldn't match anything in the ulog. a couple updates come in, and
now the serial numbers are "in range".

Now, here's where things go completely awry. 

The slave got an old database copy, but the updates applied since were new.

I am not sure if it picked up the updates from the older
slave_datatrans_<hostname> files or if the problem was the reinit and the
sno/timestamp check not being sufficient, but the result was an old database
and the ulog being reported after the transfer was the CURRENT
sno/timestamp.

When I checked from_master on the load, it looked like the new db sno. so I
know the problem was with the dump/transfer (a section of code I did NOT
change with my patches).

I will try to delve into the problem further, but one should assume a slave
will need to be promoted to a master on occasion and other slaves will be
redirected to that master after having received updates from other sources,
so this is a data integrity bug.  (I'll send a patch if I figure out the
cause, but all indications is it is not related to my prior patches but
somehow related to the new conditional dump code; it certainly was a very
lazy sno check, though from first glance I thought it would be ok, but
perhaps it really needs to be a proper ulog check.)