Erratic behavior of full resync process
Leonard J. Peirce
leonard.peirce+kerberos at wmich.edu
Tue May 12 16:44:29 EDT 2015
For several years we have run Kerberos (built from source) on Solaris 9/10.
We often dump our production master, copy the file over to our test master,
and reload it with kdb5_util. In this environment everything has always
worked as expected. A full resync of our test slave is immediately initiated,
the database on the test master is dumped, kprop'd to the slave, and reloaded
on the slave. This take just a few minutes to complete.
We're in the process of moving our Kerberos environment from Solaris
to CentOS Linux. We're running two CentOS 6.6 Linux VMs with Kerberos
RPMs (the latest available from standard repositories) of
krb5-libs-1.10.3-37.el6_6.x86_64
krb5-workstation-1.10.3-37.el6_6.x86_64
krb5-server-1.10.3-37.el6_6.x86_64
For reference:
h.test.admin.private is the master (172.30.110.45)
r.test.admin.private is the slave (172.30.110.46)
And a few other items of note:
Both hosts are running NTP so the clocks are in sync
iptables is disabled at boot
MTUs on the network interfaces are default at 1500
After configuring things I initialize the master (from a dump of our production
master) using kdb5_util, start krb5kdc and kadmin on the master and then start
kpropd on the slave, expecting a full resync to start and the slave to be
initialized. But I'm seeing strange and erratic behaviour.
In syslog, I often see this from kadmind on the master:
starting
Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
Unspecified GSS failure. Minor code may provide more information
Clock skew too great
GSS-API error strings complete.
Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
Unspecified GSS failure. Minor code may provide more information
Key version number for principal in key table is incorrect
GSS-API error strings complete.
Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
Unspecified GSS failure. Minor code may provide more information
Clock skew too great
GSS-API error strings complete.
Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
Unspecified GSS failure. Minor code may provide more information
Key version number for principal in key table is incorrect
GSS-API error strings complete.
closing down fd 32
closing down fd 33
Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
Unspecified GSS failure. Minor code may provide more information
Clock skew too great
GSS-API error strings complete.
Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
Unspecified GSS failure. Minor code may provide more information
Key version number for principal in key table is incorrect
GSS-API error strings complete.
closing down fd 32
Request: iprop_get_updates_1, UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=0; Outgoing SerialNo=N/A, success,
client=kiprop/r.test.admin.private at WMICH.EDU, service=kiprop/h.test.admin.private at WMICH.EDU, addr=172.30.110.46
Request: iprop_full_resync_1, spawned resync process 2944, client=kiprop/r.test.admin.private at WMICH.EDU,
service=kiprop/h.test.admin.private at WMICH.EDU, addr=172.30.110.46
Sometimes I just see the Request: iprop_get_updates line, and no complaints
about authentication failures or clock skew.
On the slave I see syslog entries showing repeated problems with kpropd
connecting to the master:
/usr/sbin/kpropd: GSS-API (or Kerberos) error while initializing /usr/sbin/kpropd interface, retrying
I start kpropd with -d -S and use strace on it and I see that repeatedly
opens /dev/urandom and reads from it just before I see the above error.
When running kpropd in standalone mode and with debug enabled I see
syslog entry above and eventually I see
do_standalone: setting resync alarm to 10
waiting for a kprop connection
doit: setting resync alarm to 5s
Connection from h.test.admin.private
krb5_recvauth(6, kprop5_01, host/r.test.admin.private at WMICH.EDU, ...)
authenticated client: host/h.test.admin.private at WMICH.EDU (etype == DES cbc mode with CRC-32)
/usr/sbin/kpropd: Connection reset by peer while reading database block starting at offset 92340224
Full resync was unsuccessful
Unfortunately, the resync was not successful. Often (but not always), when
kprop -f starts on the master, the slave_datatrans file will *partially*
copy to the slave, often 60-90% of the data, before the connection hangs
and then times out. I have run strace on both the kprop and kpropd processes
while they are connected. The kprop on the master hangs during a write()
for several minutes and then eventually times out:
Process 3183 attached - interrupt to quit
writev(4, [{"\240\37\26+[\16\247\tC\21\6/\243\217\340\0231f\362\245\3\214$\246\227\231N\265\351\366\1\233"..., 22106}], 1) = -1
ETIMEDOUT (Connection timed out)
write(2, "kprop", 5) = 5
write(2, ": ", 2) = 2
write(2, "Connection timed out", 20) = 20
write(2, " ", 1) = 1
write(2, "while sending database block sta"..., 49) = 49
ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffadfee150) = -1 ENOTTY (Inappropriate ioctl for device)
write(2, "\n", 1) = 1
exit_group(1) = ?
Process 3183 detached
Occasionally, when left alone, the slave will be successfully sync'd.
Sometimes this happens in minutes, even hours, and, on a few occasions,
overnight. When successful, the syslog entries on both the master and
slave look nominal. Once sync'd, incremental propagation works fine.
In my debugging attempts, I tried starting kpropd with
kpropd -S -d -P NNN
and then attempt to run
kprop -f slave_datatrans -P NNN r.test.admin.private
on the master but kpropd on the slave doesn't appear to be listening
on port NNN. Am I misunderstanding something?
In syslog on the slave, on an aborted resync I sometimes see
May 6 17:18:50 r.test.admin.private kpropd[7078]: /usr/sbin/kpropd: Connection reset by peer while reading database block
starting at offset 0
May 6 17:18:50 r.test.admin.private kpropd[7032]: kpropd: Full resync, invalid return.
Thinking that perhaps this was a problem with Kerberos that is available in
the repositories for CentOS I decided to download the latest version (1.13.2)
and build it from source. It built cleanly and passed the tests but when I
tried it I noticed the same behaviour as the Kerberos from the RPMs.
Has anyone else seen anything like this?
TIA...
__
Leonard J. Peirce
Western Michigan University
Office of Information Technology
Kalamazoo, MI 49008
More information about the Kerberos
mailing list