[krbdev.mit.edu #7588] RE: krb5-1.11 iprop bug

Sun Mar 3 23:31:43 EST 2013

The ulog serial number is set before a reload operation begins; if the
reload fails (or if the new database cannot be rendered active), the slave
ends up with the updated serial number but a stale database which is not
reflective of the version which has been set in the ulog.

I am currently testing the attached patch, which should resolve the issue:

https://github.com/rbasch/krb5/commit/2ef5ae0607d1c317a936e439b4be7a6f5184dc
2f

Currently, this situation can lead to a data consistency issue, with
significant repercussions and worse yet, be somewhat silent about the
failure having occurred.

From: Richard Basch [mailto:basch at alum.mit.edu] 
Sent: Thursday, January 03, 2013 10:42 PM
To: 'krb5-bugs at mit.edu'
Subject: RE: krb5-1.11 iprop bug

I need to research further. the code doesn't look like this situation should
have been possible unless the "time jumped" or something happened in
parallel which should not be possible under normal circumstances.

I am beginning to suspect I lost track of the state and was flipping states
but had forgotten to quiesce a full propagation in-progress.

Ignore this bug report until I can find the issue and/or reproduce the
situation.

From: Richard Basch [mailto:basch at alum.mit.edu] 
Sent: Thursday, January 03, 2013 9:29 PM
To: 'krb5-bugs at mit.edu'
Subject: krb5-1.11 iprop bug

The new iprop dump / restore code has a significant bug (my patches could
not have contributed to this issue).

>From what I can discern, when a FULL RESYNC is required, the admin server
will check if a dump already exists with the serial/timestamp in the ulog
(so far so good).  However, I think the check must be flawed. I upgraded
from an earlier version and still had slave_datatrans_* files from before
with older entries.  Furthermore, I had restarted the ulog (since the stock
code doesn't preserve the ulog, I have to assume I might have to update from
a slave to a master and that will force a re-init of the ulog).  So, in
essence, even the last slave_datatrans file might have had a sno/timestamp,
but it shouldn't match anything in the ulog. a couple updates come in, and
now the serial numbers are "in range".

Now, here's where things go completely awry. 

The slave got an old database copy, but the updates applied since were new.

I am not sure if it picked up the updates from the older
slave_datatrans_<hostname> files or if the problem was the reinit and the
sno/timestamp check not being sufficient, but the result was an old database
and the ulog being reported after the transfer was the CURRENT
sno/timestamp.

When I checked from_master on the load, it looked like the new db sno. so I
know the problem was with the dump/transfer (a section of code I did NOT
change with my patches).

I will try to delve into the problem further, but one should assume a slave
will need to be promoted to a master on occasion and other slaves will be
redirected to that master after having received updates from other sources,
so this is a data integrity bug.  (I'll send a patch if I figure out the
cause, but all indications is it is not related to my prior patches but
somehow related to the new conditional dump code; it certainly was a very
lazy sno check, though from first glance I thought it would be ok, but
perhaps it really needs to be a proper ulog check.)