Erratic behavior of full resync process
Kenneth MacDonald
Kenneth.MacDonald at ed.ac.uk
Thu May 14 11:39:43 EDT 2015
Could it be this bug?
https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64
Cheers,
Kenny.
On Tue, 2015-05-12 at 16:44 -0400, Leonard J. Peirce wrote:
> For several years we have run Kerberos (built from source) on Solaris 9/10.
> We often dump our production master, copy the file over to our test master,
> and reload it with kdb5_util. In this environment everything has always
> worked as expected. A full resync of our test slave is immediately initiated,
> the database on the test master is dumped, kprop'd to the slave, and reloaded
> on the slave. This take just a few minutes to complete.
>
> We're in the process of moving our Kerberos environment from Solaris
> to CentOS Linux. We're running two CentOS 6.6 Linux VMs with Kerberos
> RPMs (the latest available from standard repositories) of
>
> krb5-libs-1.10.3-37.el6_6.x86_64
> krb5-workstation-1.10.3-37.el6_6.x86_64
> krb5-server-1.10.3-37.el6_6.x86_64
>
> For reference:
>
> h.test.admin.private is the master (172.30.110.45)
> r.test.admin.private is the slave (172.30.110.46)
>
> And a few other items of note:
>
> Both hosts are running NTP so the clocks are in sync
> iptables is disabled at boot
> MTUs on the network interfaces are default at 1500
>
> After configuring things I initialize the master (from a dump of our production
> master) using kdb5_util, start krb5kdc and kadmin on the master and then start
> kpropd on the slave, expecting a full resync to start and the slave to be
> initialized. But I'm seeing strange and erratic behaviour.
>
> In syslog, I often see this from kadmind on the master:
>
> starting
> Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
> Unspecified GSS failure. Minor code may provide more information
> Clock skew too great
> GSS-API error strings complete.
> Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
> Unspecified GSS failure. Minor code may provide more information
> Key version number for principal in key table is incorrect
> GSS-API error strings complete.
> Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
> Unspecified GSS failure. Minor code may provide more information
> Clock skew too great
> GSS-API error strings complete.
> Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
> Unspecified GSS failure. Minor code may provide more information
> Key version number for principal in key table is incorrect
> GSS-API error strings complete.
> closing down fd 32
> closing down fd 33
> Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
> Unspecified GSS failure. Minor code may provide more information
> Clock skew too great
> GSS-API error strings complete.
> Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
> Unspecified GSS failure. Minor code may provide more information
> Key version number for principal in key table is incorrect
> GSS-API error strings complete.
> closing down fd 32
> Request: iprop_get_updates_1, UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=0; Outgoing SerialNo=N/A, success,
> client=kiprop/r.test.admin.private at WMICH.EDU, service=kiprop/h.test.admin.private at WMICH.EDU, addr=172.30.110.46
> Request: iprop_full_resync_1, spawned resync process 2944, client=kiprop/r.test.admin.private at WMICH.EDU,
> service=kiprop/h.test.admin.private at WMICH.EDU, addr=172.30.110.46
>
> Sometimes I just see the Request: iprop_get_updates line, and no complaints
> about authentication failures or clock skew.
>
> On the slave I see syslog entries showing repeated problems with kpropd
> connecting to the master:
>
> /usr/sbin/kpropd: GSS-API (or Kerberos) error while initializing /usr/sbin/kpropd interface, retrying
>
> I start kpropd with -d -S and use strace on it and I see that repeatedly
> opens /dev/urandom and reads from it just before I see the above error.
>
> When running kpropd in standalone mode and with debug enabled I see
> syslog entry above and eventually I see
>
> do_standalone: setting resync alarm to 10
> waiting for a kprop connection
> doit: setting resync alarm to 5s
> Connection from h.test.admin.private
> krb5_recvauth(6, kprop5_01, host/r.test.admin.private at WMICH.EDU, ...)
> authenticated client: host/h.test.admin.private at WMICH.EDU (etype == DES cbc mode with CRC-32)
> /usr/sbin/kpropd: Connection reset by peer while reading database block starting at offset 92340224
> Full resync was unsuccessful
>
> Unfortunately, the resync was not successful. Often (but not always), when
> kprop -f starts on the master, the slave_datatrans file will *partially*
> copy to the slave, often 60-90% of the data, before the connection hangs
> and then times out. I have run strace on both the kprop and kpropd processes
> while they are connected. The kprop on the master hangs during a write()
> for several minutes and then eventually times out:
>
> Process 3183 attached - interrupt to quit
> writev(4, [{"\240\37\26+[\16\247\tC\21\6/\243\217\340\0231f\362\245\3\214$\246\227\231N\265\351\366\1\233"..., 22106}], 1) = -1
> ETIMEDOUT (Connection timed out)
> write(2, "kprop", 5) = 5
> write(2, ": ", 2) = 2
> write(2, "Connection timed out", 20) = 20
> write(2, " ", 1) = 1
> write(2, "while sending database block sta"..., 49) = 49
> ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffadfee150) = -1 ENOTTY (Inappropriate ioctl for device)
> write(2, "\n", 1) = 1
> exit_group(1) = ?
> Process 3183 detached
>
> Occasionally, when left alone, the slave will be successfully sync'd.
> Sometimes this happens in minutes, even hours, and, on a few occasions,
> overnight. When successful, the syslog entries on both the master and
> slave look nominal. Once sync'd, incremental propagation works fine.
>
> In my debugging attempts, I tried starting kpropd with
>
> kpropd -S -d -P NNN
>
> and then attempt to run
>
> kprop -f slave_datatrans -P NNN r.test.admin.private
>
> on the master but kpropd on the slave doesn't appear to be listening
> on port NNN. Am I misunderstanding something?
>
> In syslog on the slave, on an aborted resync I sometimes see
>
> May 6 17:18:50 r.test.admin.private kpropd[7078]: /usr/sbin/kpropd: Connection reset by peer while reading database block
> starting at offset 0
> May 6 17:18:50 r.test.admin.private kpropd[7032]: kpropd: Full resync, invalid return.
>
> Thinking that perhaps this was a problem with Kerberos that is available in
> the repositories for CentOS I decided to download the latest version (1.13.2)
> and build it from source. It built cleanly and passed the tests but when I
> tried it I noticed the same behaviour as the Kerberos from the RPMs.
>
> Has anyone else seen anything like this?
>
> TIA...
> __
> Leonard J. Peirce
> Western Michigan University
> Office of Information Technology
> Kalamazoo, MI 49008
> ________________________________________________
> Kerberos mailing list Kerberos at mit.edu
> https://mailman.mit.edu/mailman/listinfo/kerberos
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Kerberos
mailing list