Erratic behavior of full resync process

Thu May 14 11:39:43 EDT 2015

Could it be this bug?

https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

Cheers,

Kenny.

On Tue, 2015-05-12 at 16:44 -0400, Leonard J. Peirce wrote:
> For several years we have run Kerberos (built from source) on Solaris 9/10.
> We often dump our production master, copy the file over to our test master,
> and reload it with kdb5_util.  In this environment everything has always
> worked as expected.  A full resync of our test slave is immediately initiated,
> the database on the test master is dumped, kprop'd to the slave, and reloaded
> on the slave.  This take just a few minutes to complete.
> 
> We're in the process of moving our Kerberos environment from Solaris
> to CentOS Linux.  We're running two CentOS 6.6 Linux VMs with Kerberos
> RPMs (the latest available from standard repositories) of
> 
>     krb5-libs-1.10.3-37.el6_6.x86_64
>     krb5-workstation-1.10.3-37.el6_6.x86_64
>     krb5-server-1.10.3-37.el6_6.x86_64
> 
> For reference:
> 
>     h.test.admin.private is the master (172.30.110.45)
>     r.test.admin.private is the slave (172.30.110.46)
> 
> And a few other items of note:
> 
>     Both hosts are running NTP so the clocks are in sync
>     iptables is disabled at boot
>     MTUs on the network interfaces are default at 1500
> 
> After configuring things I initialize the master (from a dump of our production
> master) using kdb5_util, start krb5kdc and kadmin on the master and then start
> kpropd on the slave, expecting a full resync to start and the slave to be
> initialized.  But I'm seeing strange and erratic behaviour.
> 
> In syslog, I often see this from kadmind on the master:
> 
>     starting
>     Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
>         Unspecified GSS failure.  Minor code may provide more information
>         Clock skew too great
>         GSS-API error strings complete.
>     Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
>         Unspecified GSS failure.  Minor code may provide more information
>         Key version number for principal in key table is incorrect
>         GSS-API error strings complete.
>     Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
>         Unspecified GSS failure.  Minor code may provide more information
>         Clock skew too great
>         GSS-API error strings complete.
>     Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
>         Unspecified GSS failure.  Minor code may provide more information
>         Key version number for principal in key table is incorrect
>         GSS-API error strings complete.
>     closing down fd 32
>     closing down fd 33
>     Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
>         Unspecified GSS failure.  Minor code may provide more information
>         Clock skew too great
>         GSS-API error strings complete.
>     Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
>         Unspecified GSS failure.  Minor code may provide more information
>         Key version number for principal in key table is incorrect
>         GSS-API error strings complete.
>     closing down fd 32
>     Request: iprop_get_updates_1, UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=0; Outgoing SerialNo=N/A, success, 
> client=kiprop/r.test.admin.private at WMICH.EDU, service=kiprop/h.test.admin.private at WMICH.EDU, addr=172.30.110.46
>     Request: iprop_full_resync_1, spawned resync process 2944, client=kiprop/r.test.admin.private at WMICH.EDU, 
> service=kiprop/h.test.admin.private at WMICH.EDU, addr=172.30.110.46
> 
> Sometimes I just see the Request: iprop_get_updates line, and no complaints
> about authentication failures or clock skew.
> 
> On the slave I see syslog entries showing repeated problems with kpropd
> connecting to the master:
> 
>     /usr/sbin/kpropd: GSS-API (or Kerberos) error while initializing /usr/sbin/kpropd interface, retrying
> 
> I start kpropd with -d -S and use strace on it and I see that repeatedly
> opens /dev/urandom and reads from it just before I see the above error.
> 
> When running kpropd in standalone mode and with debug enabled I see
> syslog entry above and eventually I see
> 
>     do_standalone: setting resync alarm to 10
>     waiting for a kprop connection
>     doit: setting resync alarm to 5s
>     Connection from h.test.admin.private
>     krb5_recvauth(6, kprop5_01, host/r.test.admin.private at WMICH.EDU, ...)
>     authenticated client: host/h.test.admin.private at WMICH.EDU (etype == DES cbc mode with CRC-32)
>     /usr/sbin/kpropd: Connection reset by peer while reading database block starting at offset 92340224
>     Full resync was unsuccessful
> 
> Unfortunately, the resync was not successful.  Often (but not always), when
> kprop -f starts on the master, the slave_datatrans file will *partially*
> copy to the slave, often 60-90% of the data, before the connection hangs
> and then times out.  I have run strace on both the kprop and kpropd processes
> while they are connected.  The kprop on the master hangs during a write()
> for several minutes and then eventually times out:
> 
>     Process 3183 attached - interrupt to quit
>     writev(4, [{"\240\37\26+[\16\247\tC\21\6/\243\217\340\0231f\362\245\3\214$\246\227\231N\265\351\366\1\233"..., 22106}], 1) = -1 
> ETIMEDOUT (Connection timed out)
>     write(2, "kprop", 5)                    = 5
>     write(2, ": ", 2)                       = 2
>     write(2, "Connection timed out", 20)    = 20
>     write(2, " ", 1)                        = 1
>     write(2, "while sending database block sta"..., 49) = 49
>     ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffadfee150) = -1 ENOTTY (Inappropriate ioctl for device)
>     write(2, "\n", 1)                       = 1
>     exit_group(1)                           = ?
>     Process 3183 detached
> 
> Occasionally, when left alone, the slave will be successfully sync'd.
> Sometimes this happens in minutes, even hours, and, on a few occasions,
> overnight.  When successful, the syslog entries on both the master and
> slave look nominal.  Once sync'd, incremental propagation works fine.
> 
> In my debugging attempts, I tried starting kpropd with
> 
>     kpropd -S -d -P NNN
> 
> and then attempt to run
> 
>     kprop -f slave_datatrans -P NNN r.test.admin.private
> 
> on the master but kpropd on the slave doesn't appear to be listening
> on port NNN.  Am I misunderstanding something?
> 
> In syslog on the slave, on an aborted resync I sometimes see
> 
>     May  6 17:18:50 r.test.admin.private kpropd[7078]: /usr/sbin/kpropd: Connection reset by peer while reading database block 
> starting at offset 0
>     May  6 17:18:50 r.test.admin.private kpropd[7032]: kpropd: Full resync, invalid return.
> 
> Thinking that perhaps this was a problem with Kerberos that is available in
> the repositories for CentOS I decided to download the latest version (1.13.2)
> and build it from source.  It built cleanly and passed the tests but when I
> tried it I noticed the same behaviour as the Kerberos from the RPMs.
> 
> Has anyone else seen anything like this?
> 
> TIA...
> __
> Leonard J. Peirce
> Western Michigan University
> Office of Information Technology
> Kalamazoo, MI  49008
> ________________________________________________
> Kerberos mailing list           Kerberos at mit.edu
> https://mailman.mit.edu/mailman/listinfo/kerberos
> 

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.