Erratic behavior of full resync process

Wed Jun 17 09:26:40 EDT 2015

Greg Hudson wrote:
> /dev/random starvation explains the clock skew errors (kadmind isn't
> processing the kpropd authentication attempts until much later than they
> were sent) but doesn't really explain to me why your full dump
> connections are sometimes timing out.  kdb5_util load does not read from
> /dev/random as far as I can tell, and neither do the mk_priv/rd_priv
> calls used to protect the dump data in transport.

The problem of kprop timing out is unrelated to the problem I was having
with kadmind starting up.  This issue didn't pop up again until I went
back to my original VM configuration (with my original network
configuration).

The cause of kprop hanging was the MTU setting on our CentOS VMs.  In the
network segment where our VMs are running we have always run jumbo frames
with the MTU on Solaris set at 9000.  Running CentOS on physical hardware
I can set the MTUs to 9000 and everything works great.  But with the
interfaces on the CentOS VMs (running on Microsoft Hyper-V) configured
with an MTU of 9000 kprop would hang a majority of the time.  When
running VMs on Microsoft Hyper-V (where the MTU for the interfaces
presented to our VMs by Hyper-V is 9014) there is apparently enough
overhead (for things like VLAN tagging, talking through Hyper-V's
virtual switch, vendor differences WRT maximum frame size vs. MTU,
etc.) that the maximum MTU is 8972.  With this MTU kprop works fine.

Unrelated to this I did notice something interesting.  After reloading
the database with kdb5_util kadmind naturally forces a full resync of our
slave.  Immediately after the full resync any update will cause our master
to force *another* full resync of our slave.  After this second full
resync incremental propagation takes over.

- Leonard