Trace logging project

Mon Sep 14 18:37:57 EDT 2009

On Mon, Sep 14, 2009 at 06:14:52PM -0400, Greg Hudson wrote:
> On Mon, 2009-09-14 at 17:00 -0400, Nicolas Williams wrote:
> > So what is the real goal?
> 
> We received two independent requests in the course of one week, from
> different angles.  I'll present them as use cases:
> 
> 1. You market an application which uses Kerberos as a core component.
> You are experiencing rare, inexplicable failures in customer
> deployments.  You want to be able to collect more information about
> failures in order to reproduce and debug them.

Indeed, but once you're done you want to modify the application so that
it recognizes the error condition and gives the user advice.  Otherwise
you keep getting the same calls over and over.  Updating documentation
helps, but not enough -- the UIs really need to be able to tell the user
useful information.  I know this because I've been there.  The Solaris
smbd and idmapd daemons have no access to Kerberos error data in LDAP
contexts because our libldap aliases all SASL errors as "local error"
(error 80), which has given rise to many calls.  But even if it didn't,
making sure that we convey all of the relevant errors and ancilliary
data up the stack (krb5->gss->sasl gss plug-in->sasl->ldap->app) is
tricky.  Having trace data is a _lot_ better than nothing: these apps
could just point the user at the trace data, or at analysis of it
(assuming the trace output is stable enough and machine parseable).  But
it'd be nicer if these apps could just say "synchronize your clock" or
"re-join your domain", and so on.

Without improved error reporting, tracing is a band-aid.  A very welcome
band-aid, no doubt (particularly if you don't have dynamic tracing
facilities).

> In use case #2, the client failure is typically coming from a program
> you didn't write, which may not have a rich interface for displaying
> errors.

In the case of interop problems, you may really need to see a capture of
all PDUs (including cleartext of any EncryptedData).  I've had to build
DTrace scripts to get at such cleartext as I had no other way to get it.

In the case of interaction with other components on the same system
(e.g., NTP) I'd say what you really have is a variant of case #1.

> >From elsewhere in the thread:
> > Context initialization _failures_ are interesting.
> 
> It seems like context initialization failures are about the simplest
> type of possible krb5 failure, and can easily be captured in the text of
> a simple error message.

True.

> Regardless, the point is likely moot; context initialization failures
> can likely be traced using the half-constructed context.

Yes.

What do you need the context for?  I'd say: to give the app direct
control over whether to enable tracing, and where to output the traces,
which information should be conveyed via the context argument.

Nico
--