Weird KDC behaviour with getprincs/kdb5_util (V5 1.2.2, Solaris 8)

Wed Mar 27 15:12:21 EST 2002

Hi....

Thanks for your quick and informative response.  Quite frankly, we've been
pretty stumped as to what happened and since we haven't written any Kerberos-
related code I wanted to see if anyone had seen this and, if possible, had
any ideas.

>It's not trivial to recover; you'll probably have to write code for
>operating on the database directly rather than through the so-called
>higher-level interfaces we use for most operations that understand the
>Kerberos database format.  We don't have any such code lying around at
>the moment.

Darn, that was going to be my first question:  Do you have anything that
would walk through *sequentially* and retrieve the records?  Or is there
anything that ships with the source that might have some code I can steal?
Just something that has the record (structure definition) and the key format?

See below for an idea that I have....

>Have you dumped and reloaded your master database any time recently?
>We switched to btree format a while back, but if you never dumped and
>reloaded, you may still be using hash format, which would not be good.
>You can tell the database type by the magic number in the first four
>bytes -- 0x053162 is btree, 0x061561 is hash.

Odd....the magic number says we are still running hash.

When we upgraded to 1.2.2 we dumped and reloaded the principal database.
After we reloaded I did kadmin getprincs and looked at the list, even
running it against a list of known principals.  Everything looked fine.

More info....when we were running 1.2.1 last year there was a patch posted
for kadmin/dbutil/dump.c.  We had applied it and were running with it when
we dumped the database before the upgrade.  The patch went in clean and
everything rebuilt fine.  The patch is appended to the end of this message.

Even though everything looked ok with the patch (as far as I could tell)
is it possible that something was wrong and it trashed out just enough
to a) look ok on initial reload but b) cause problems down the line?

>The more interesting part right now is how to get the data out, so
>that you can stuff it back into a database in a different format.
>Even though you can't walk through the database sequentially, there
>are still a couple ways you may be able to extract the data.

Since kadmin getprinc can retrieve principals that getprincs/kdb5_util
cannot, does that mean that kadmin getprinc is accessing the database
sequentially?  Again, see below for my idea...

>First, if you have a complete list of current principal names, write a
>little program to walk over that list, generate the correct database
>key for each name, and extract the data from the database through the
>db2 interface.  Then write it into another, freshly-created database.

I'm already thinking along these lines (see below).  Again, the entries
*do* exist.  We just need something to get them out and rewrite the
database.

>Or, second, if the above approach doesn't work, open the database file
>as a plain file, and simply scan through it, taking note of anything
>that looks like it might be a database record.  ASCII string for the
>principal name in the key with a limited range of characters.  For the
>database record, key data has reasonable key types and correct
>lengths, reasonable-looking flags set on the principals, etc.  Once
>you get names, maybe you can use the db2 interface to read out the
>data.  If not, you'll have to decipher the database format enough to
>locate the data yourself and pull it out.
>
> [...]
>
>> The really odd part is that the principals that don't show up are in the
>> database and continue to work fine.  Users can get tickets, use them for
>> rlogin/telnet/ftp, and change their passwords.  We can do getprinc for any
>> one of the missing entries and they show up just fine.  But running getprincs
>> to list the entire database or kdb5_util dump both fail to list them.
>
>Yes, this is consistent.  Random and sequential access often use very
>different code paths.

Again, does this mean that kadmin can retrieve and display the principal
records that it uses sequential (and that kdb5_util uses random)?  This
might be handy for an idea that I have (see below)....

>There is a chance that it's just a bug with sequentially retrieving
>data from the database.  The only way that helps you, though, is that
>*if* you go and find the bug and fix it, then you don't still have the
>problem of extracting what data you can from a broken database.  The
>problem still has to be fixed for your slave KDCs to become useful
>again.

More background....

Our student username principals are in the form [a-z][0-9][a-z]+ at REALM.  An
example would be something like s8peirce at WMICH.EDU.  Faculty/staff are
usually something like <lastname>@WMICH.EDU.  Mine, for example, is
peirce at WMICH.EDU.

When we do a getprincs and compare it to the list of user principals that
we know are in the database we notice a strange pattern.  kadmin getprincs
outputs the principal names one per line in what appears to be alphabetical
order.  Everything looks ok up through a principal e9stock at WMICH.EDU.  It
then skips to principals that start with s8 and then continues, e.g.

   e9stern
   e9steure
   e9steven
   e9stock
   s8bruski
   s8bryant
   s8bryson

The number of principals skipped is quite large (> 20K).

Here's the strange part of the pattern....

When we look at an alphabetical list of the principals that *should* be
displayed we notice that three principals after e9stock at WMICH.EDU that
start with e9 that are actually *missing* in the database.  We know that
they were successfully added at one point from our account creation logs
and now they're gone.  Could the corruption be localized to just these
three accounts?  I'm tempted to try to recreate these principals and dumping
again to see if things clear up.  Any chance it might work?

Idea #1
=======
We know the principals are in the database since getprinc can retrieve
them.  One at a time but there is still *some* hope that we can extract
them.  Would it be possible to:

   1) Dump what we can with kdb5_util dump, knowing that not all of
      the principals will be in the dump file.
   2) Since kadmin can retrive one principal at a time borrow code
      from it to retrieve the individual records.  If this isn't
      feasible, try to access the database directly with db2 calls
      to retrive the records.
   3) Borrow code from kdb5_util to take a record and write in the
      ASCII format used by kdb5_util dump.
   4) Combine 2 & 3 into one program and run it, retrieving the missing
      records using the method in step 2 and write them with the method
      from step 3, appending them to the file created in step 1.
   5) Reload the database with kdb5_util load <dump file from steps 1 & 4>.

Sorry for being so verbose.  I'm just trying to make sure I don't miss
something....

Idea #2
=======
We still have the dump file and the principal database from before the
upgrade to 1.2.2 back in October.  Granted, there have been a lot of changes
since that time (new principals, principals deleted, passwords changed, etc.).
But if we get truly stuck, would it be possible to take as many principals
as we can from the old dump file, append them to the file created from step 1
in Idea 1, and reload?  We'd miss some and have to go through the pain of
adding and getting passwords to the missing principal owners but if we can
get a working database again it might be worth it.

You're probably cringing right now, imagining that I might try any of my
ideas on our production database.  Naturally, all testing would be done
on a separate test KDC.

Thanks for any help you can provide.  Right now we're kind of at a loss about
how to proceed.  We have a test machine built and I'm looking at the code.
But not having really looked at the kadmin code (I've worked with the client
side a bit) I'm at a bit of a disadvantage.

--
Leonard J. Peirce                    Email:  leonard.peirce at wmich.edu
UNIX System Administrator
Western Michigan University
Office of Information Technology
Kalamazoo, MI  49008                 Phone:  (616) 387-5430




diff -uNr krb5-1.2.1-orig/src/kadmin/dbutil/dump.c krb5-1.2.1/src/kadmin/dbutil/dump.c

--- krb5-1.2.1-orig/src/kadmin/dbutil/dump.c	Thu Jun 29 22:27:28 2000
+++ krb5-1.2.1/src/kadmin/dbutil/dump.c	Fri Jul 21 12:56:21 2000
@@ -639,7 +639,7 @@
     char		*name;
     krb5_tl_data	*tlp;
     krb5_key_data	*kdata;
-    int			counter, skip, i, j;
+    int			counter, i, j;

     /* Initialize */
     arg = (struct dump_args *) ptr;
@@ -695,28 +695,15 @@
 	/*
 	 * Make sure that the tagged list is reasonably correct.
 	 */
-	counter = skip = 0;
-	for (tlp = entry->tl_data; tlp; tlp = tlp->tl_data_next) {
-	     /*
-	      * don't dump tl data types we know aren't understood by
-	      * earlier revisions [krb5-admin/89]
-	      */
-	     switch (tlp->tl_data_type) {
-	     case KRB5_TL_KADM_DATA:
-		  skip++;
-		  break;
-	     default:
-		  counter++;
-		  break;
-	     }
-	}
-
-	if (counter + skip == entry->n_tl_data) {
+        counter = 0;
+        for (tlp = entry->tl_data; tlp; tlp = tlp->tl_data_next)
+            counter++;
+        if (counter == entry->n_tl_data) {
 	    /* Pound out header */
 	    fprintf(arg->ofile, "%d\t%d\t%d\t%d\t%d\t%s\t",
 		    (int) entry->len,
 		    strlen(name),
-		    counter,
+                  (int) entry->n_tl_data,
 		    (int) entry->n_key_data,
 		    (int) entry->e_length,
 		    name);
@@ -731,9 +718,6 @@
 		    entry->fail_auth_count);
 	    /* Pound out tagged data. */
 	    for (tlp = entry->tl_data; tlp; tlp = tlp->tl_data_next) {
-		if (tlp->tl_data_type == KRB5_TL_KADM_DATA)
-		     continue; /* see above, [krb5-admin/89] */
-
 		fprintf(arg->ofile, "%d\t%d\t",
 			(int) tlp->tl_data_type,
 			(int) tlp->tl_data_length);
@@ -780,8 +764,7 @@
 	}
 	else {
 	    fprintf(stderr, sdump_tl_inc_err,
-		    arg->programname, name, counter+skip,
-		    (int) entry->n_tl_data);
+                    arg->programname, name, counter, (int) entry->n_tl_data);
 	    retval = EINVAL;
 	}
     }