Erratic behavior of full resync process

Tue May 12 16:44:29 EDT 2015

For several years we have run Kerberos (built from source) on Solaris 9/10.
We often dump our production master, copy the file over to our test master,
and reload it with kdb5_util.  In this environment everything has always
worked as expected.  A full resync of our test slave is immediately initiated,
the database on the test master is dumped, kprop'd to the slave, and reloaded
on the slave.  This take just a few minutes to complete.

We're in the process of moving our Kerberos environment from Solaris
to CentOS Linux.  We're running two CentOS 6.6 Linux VMs with Kerberos
RPMs (the latest available from standard repositories) of

    krb5-libs-1.10.3-37.el6_6.x86_64
    krb5-workstation-1.10.3-37.el6_6.x86_64
    krb5-server-1.10.3-37.el6_6.x86_64

For reference:

    h.test.admin.private is the master (172.30.110.45)
    r.test.admin.private is the slave (172.30.110.46)

And a few other items of note:

    Both hosts are running NTP so the clocks are in sync
    iptables is disabled at boot
    MTUs on the network interfaces are default at 1500

After configuring things I initialize the master (from a dump of our production
master) using kdb5_util, start krb5kdc and kadmin on the master and then start
kpropd on the slave, expecting a full resync to start and the slave to be
initialized.  But I'm seeing strange and erratic behaviour.

In syslog, I often see this from kadmind on the master:

    starting
    Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
        Unspecified GSS failure.  Minor code may provide more information
        Clock skew too great
        GSS-API error strings complete.
    Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
        Unspecified GSS failure.  Minor code may provide more information
        Key version number for principal in key table is incorrect
        GSS-API error strings complete.
    Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
        Unspecified GSS failure.  Minor code may provide more information
        Clock skew too great
        GSS-API error strings complete.
    Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
        Unspecified GSS failure.  Minor code may provide more information
        Key version number for principal in key table is incorrect
        GSS-API error strings complete.
    closing down fd 32
    closing down fd 33
    Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
        Unspecified GSS failure.  Minor code may provide more information
        Clock skew too great
        GSS-API error strings complete.
    Authentication attempt failed: 172.30.110.46, GSS-API error strings are:
        Unspecified GSS failure.  Minor code may provide more information
        Key version number for principal in key table is incorrect
        GSS-API error strings complete.
    closing down fd 32
    Request: iprop_get_updates_1, UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=0; Outgoing SerialNo=N/A, success, 
client=kiprop/r.test.admin.private at WMICH.EDU, service=kiprop/h.test.admin.private at WMICH.EDU, addr=172.30.110.46
    Request: iprop_full_resync_1, spawned resync process 2944, client=kiprop/r.test.admin.private at WMICH.EDU, 
service=kiprop/h.test.admin.private at WMICH.EDU, addr=172.30.110.46

Sometimes I just see the Request: iprop_get_updates line, and no complaints
about authentication failures or clock skew.

On the slave I see syslog entries showing repeated problems with kpropd
connecting to the master:

    /usr/sbin/kpropd: GSS-API (or Kerberos) error while initializing /usr/sbin/kpropd interface, retrying

I start kpropd with -d -S and use strace on it and I see that repeatedly
opens /dev/urandom and reads from it just before I see the above error.

When running kpropd in standalone mode and with debug enabled I see
syslog entry above and eventually I see

    do_standalone: setting resync alarm to 10
    waiting for a kprop connection
    doit: setting resync alarm to 5s
    Connection from h.test.admin.private
    krb5_recvauth(6, kprop5_01, host/r.test.admin.private at WMICH.EDU, ...)
    authenticated client: host/h.test.admin.private at WMICH.EDU (etype == DES cbc mode with CRC-32)
    /usr/sbin/kpropd: Connection reset by peer while reading database block starting at offset 92340224
    Full resync was unsuccessful

Unfortunately, the resync was not successful.  Often (but not always), when
kprop -f starts on the master, the slave_datatrans file will *partially*
copy to the slave, often 60-90% of the data, before the connection hangs
and then times out.  I have run strace on both the kprop and kpropd processes
while they are connected.  The kprop on the master hangs during a write()
for several minutes and then eventually times out:

    Process 3183 attached - interrupt to quit
    writev(4, [{"\240\37\26+[\16\247\tC\21\6/\243\217\340\0231f\362\245\3\214$\246\227\231N\265\351\366\1\233"..., 22106}], 1) = -1 
ETIMEDOUT (Connection timed out)
    write(2, "kprop", 5)                    = 5
    write(2, ": ", 2)                       = 2
    write(2, "Connection timed out", 20)    = 20
    write(2, " ", 1)                        = 1
    write(2, "while sending database block sta"..., 49) = 49
    ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffadfee150) = -1 ENOTTY (Inappropriate ioctl for device)
    write(2, "\n", 1)                       = 1
    exit_group(1)                           = ?
    Process 3183 detached

Occasionally, when left alone, the slave will be successfully sync'd.
Sometimes this happens in minutes, even hours, and, on a few occasions,
overnight.  When successful, the syslog entries on both the master and
slave look nominal.  Once sync'd, incremental propagation works fine.

In my debugging attempts, I tried starting kpropd with

    kpropd -S -d -P NNN

and then attempt to run

    kprop -f slave_datatrans -P NNN r.test.admin.private

on the master but kpropd on the slave doesn't appear to be listening
on port NNN.  Am I misunderstanding something?

In syslog on the slave, on an aborted resync I sometimes see

    May  6 17:18:50 r.test.admin.private kpropd[7078]: /usr/sbin/kpropd: Connection reset by peer while reading database block 
starting at offset 0
    May  6 17:18:50 r.test.admin.private kpropd[7032]: kpropd: Full resync, invalid return.

Thinking that perhaps this was a problem with Kerberos that is available in
the repositories for CentOS I decided to download the latest version (1.13.2)
and build it from source.  It built cleanly and passed the tests but when I
tried it I noticed the same behaviour as the Kerberos from the RPMs.

Has anyone else seen anything like this?

TIA...
__
Leonard J. Peirce
Western Michigan University
Office of Information Technology
Kalamazoo, MI  49008