weird incremental sync failure

Paul B. Henson henson at acm.org
Mon Oct 17 21:56:58 EDT 2011


We were doing an expiration run this past weekend, and had a bit of
weirdness/failure in replication. The job started at 3:30am, and for about
40 minutes, incremental update requests were refused due to the load:

Oct 15 03:31:32 halfy kadmind[2935]: Request: iprop_get_updates_1,
UPDATE_BUSY; Incoming SerialNo=76369; Outgoing SerialNo=N/A, success,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10
[...]
Oct 15 04:10:00 halfy kadmind[2935]: Request: iprop_get_updates_1,
UPDATE_BUSY; Incoming SerialNo=76369; Outgoing SerialNo=N/A, success,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10

Finally, the slave decided it wanted to do a full sync:

Oct 15 04:15:00 halfy kadmind[2935]: Request: iprop_get_updates_1,
UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=76369; Outgoing
SerialNo=N/A, success,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10

Oct 15 04:15:00 halfy kadmind[2935]: Request: iprop_full_resync_1,
spawned resync process 23469,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10

but then the child process *also* spawned a child process:

Oct 15 04:17:34 halfy kadmind[23469]: Request: iprop_full_resync_1,
spawned resync process 23901,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10

which spawned a child process:

Oct 15 04:25:17 halfy kadmind[23901]: Request: iprop_get_updates_1,
UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=76369; Outgoing
SerialNo=N/A, success,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10

Oct 15 04:25:17 halfy kadmind[23901]: Request: iprop_full_resync_1,
spawned resync process 24637,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10

which *finally* did the transfer:

Oct 15 04:27:51 halfy kadmind[24637]: Request: iprop_get_updates_1,
UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=76369; Outgoing
SerialNo=N/A, success,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10


There was a similar chain of processes for the other slave server,
repeating a number of times, so in the end there were like 20 kadmin
processes running, and it looks they they were all serving requests:

Oct 15 05:39:54 halfy kadmind[24637]: Request: iprop_get_updates_1,
UPDATE_OK; Incoming SerialNo=89261; Outgoing SerialNo=89263, success,
client=kiprop/petey.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.10
Oct 15 05:40:30 halfy kadmind[24293]: Request: iprop_get_updates_1,
UPDATE_OK; Incoming SerialNo=89261; Outgoing SerialNo=89263, success,
client=kiprop/loogie.unx.csupomona.edu at CSUPOMONA.EDU,
service=kiprop/kerberos-master.csupomona.edu at CSUPOMONA.EDU,
addr=134.71.247.11
Oct 15 05:54:31 halfy kadmind[25049]: Request: kadm5_chpass_principal,
dbgossner at CSUPOMONA.EDU, success, client=idmgmt at CSUPOMONA.EDU,
service=kadmin/admin at CSUPOMONA.EDU, addr=134.71.247.23


I think this failure had something to do with running out of file
descriptors:

Oct 15 03:30:02 halfy kadmind[2935]: dropping tcp fd 32 from
134.71.247.23.47346
Oct 15 03:30:02 halfy kadmind[2935]: dropping tcp fd 32 from
134.71.247.23.47346
Oct 15 04:20:09 halfy kadmind[23907]: too many connections
Oct 15 04:20:09 halfy kadmind[23907]: too many connections
Oct 15 04:20:09 halfy kadmind[23907]: dropping tcp fd 34 from
134.71.247.23.46232
Oct 15 04:20:09 halfy kadmind[23907]: dropping tcp fd 34 from
134.71.247.23.46232
Oct 15 04:22:43 halfy kadmind[24274]: too many connections
Oct 15 04:22:43 halfy kadmind[24274]: too many connections
Oct 15 04:22:43 halfy kadmind[24274]: dropping tcp fd 17 from
134.71.247.23.46777
Oct 15 04:22:43 halfy kadmind[24274]: dropping tcp fd 17 from
134.71.247.23.46777
Oct 15 04:25:17 halfy kadmind[23907]: too many connections
Oct 15 04:25:17 halfy kadmind[23907]: too many connections
Oct 15 04:25:17 halfy kadmind[23907]: dropping tcp fd 17 from
134.71.247.23.46777
Oct 15 04:25:17 halfy kadmind[23907]: dropping tcp fd 17 from
134.71.247.23.46777
Oct 15 04:27:51 halfy kadmind[24640]: too many connections
Oct 15 04:27:51 halfy kadmind[24640]: too many connections
Oct 15 04:27:51 halfy kadmind[24640]: dropping tcp fd 14 from
134.71.247.23.41273
Oct 15 04:27:51 halfy kadmind[24640]: dropping tcp fd 14 from
134.71.247.23.41273
Oct 15 04:29:40 halfy kadmind[24274]: too many connections
Oct 15 04:29:40 halfy kadmind[24274]: too many connections
Oct 15 04:29:40 halfy kadmind[24274]: dropping tcp fd 14 from
134.71.247.23.41273
Oct 15 04:29:40 halfy kadmind[24274]: dropping tcp fd 14 from
134.71.247.23.41273
Oct 15 04:30:25 halfy kadmind[24268]: too many connections
Oct 15 04:30:25 halfy kadmind[24268]: too many connections
Oct 15 04:30:25 halfy kadmind[24268]: dropping tcp fd 34 from
134.71.247.23.46232
Oct 15 04:30:25 halfy kadmind[24268]: dropping tcp fd 34 from

I ended up having to kill all the kadmin processes and restart the
server. I don't think there is ever supposed to be more than 1 process
working, I hope having all of them mucking around at the same time
didn't screw up my database :(.

I've tracked down the root cause of too many connections to the kadmin
server, but this seems like a separate bug? We're running 1.9.1.

We did another batch job the following day, which had a similar pattern
of refusing incrementals and then requesting a full resync, but it
worked fine. Given I'd just restarted everything it wasn't having the
too many connections issue, which is why I think the first failure is
related to that...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  henson at csupomona.edu
California State Polytechnic University  |  Pomona CA 91768



More information about the Kerberos mailing list