Logic behind lib/krb5/os/k5_sendto()

Thu Apr 18 14:25:51 EDT 2019

Hello Greg,

thanks for your answer.  The comments in the source file say:

Restarted:
  Per UDP server, 1s
  Per TCP server, 1s

Does this mean, that the TCP connection is also retried more than once?  You wrote, that there is a single try to open a
TCP connection.

My setup is:
• There are services running (smtp, imap, webdav) where users authenticate over SASL, 
• Users do not call kinit and the domain of their hosts has no relation to the realm.
• Within the server, the authentication towards the services runs over cyrus-sasl/saslauthd, which performs PAM.
• PAM means first try Kerberos using pam_krb5-2.4.12, and then LDAP.  These are two different databases and a user is in
only one of them.
• In almost all cases the answer from saslauthd is, that Kerberos does not know the server, so the authentication
continues with PAM/LDAP.

Sometimes saslauthd gets stuck.  Then some thousand service processes are waiting for saslauthd to reply, the load
average of the system increases and then of course everything gets very slow.

All saslauthd processes wait for reply in lib/krb5/os/sendto_kdc.c:k5_sendto().  

I guess krb5kdc is slow in replying.  I have removed on the meantime the logging, which should speed up the replies from
krb5kdc.

But I think resending the queries in this case to krb5kdc makes think worse, because the krb5kdc will have to deal then
with even  more (repeated) queries, and this slows everything down, when it is already slow, compared to a case, where
queries are not retried.

What do you think?

Regards
  Дилян

On Tue, 2019-04-16 at 10:39 -0400, Greg Hudson wrote:
> On 4/15/19 5:48 PM, Дилян Палаузов wrote:
> > kinit x at expamle.org is called and in dns the SRV records _kerberos._udp.example.org and _kerberos._tcp.example.org show
> > that k.example.org:88 is in charge.  But the KDC there is in fact in charge of EXAMPLE.ORG, example.org being non-local
> > realm.
> > 
> > • If k5_sendto receives an answer from a KDC, that the realm is non-local, does it retry to the other KDCs, here asking
> > the same process over a different transport protocol?
> 
> If example.org issues a client referral (KDC_ERR_WRONG_REALM) to
> EXAMPLE.ORG, k5_sendto() will return the error response, and the
> higher-level logic will (if canonicalization is enabled) retry with
> EXAMPLE.ORG, which will contact the same KDC.
> 
> > When is the timeout (interval) increased from 1s to to 2s?  If there is no answer within 1s, then the query is resent,
> > waiting for 2 more seconds?
> 
> If there is a single KDC with both UDP and TCP transports, the normal
> schedule is:
> 
> 1. Send UDP request, wait one second
> 2. Begin TCP connection, wait one second
> 3. Wait two seconds
> 4. Send UDP request, wait one second
> 5. Wait four seconds
> 6. Send UDP request, wait one second
> 7. Wait eight seconds
> 8. Send UDP request, wait one second
> 9. Wait sixteen seconds
> 10. Give up
> 
> If there are multiple KDCs, the steps to send UDP requests or begin TCP
> connections are iterated over each KDC, with a one-second wait after each.