malloc hang inside krb5_sendto_kdc

Wed Feb 4 06:49:50 EST 2004

In article <tsloesgc5bv.fsf at konishi-polis.mit.edu>,
Sam Hartman <hartmans at MIT.EDU> wrote:
>
>No, this is not a known bug, but we are concerned that others may be
>seeing it.  We don't really know what is going on.

I attached to another one and poked around so more. context doesn't
look right in the frame that calls malloc. It contains 0x1 where when
krb5_sendto_kdc is invoked, it contains sane looking values ala...

(gdb) print *context 
$29 = {magic = -1760647388, in_tkt_ktypes = 0x0, in_tkt_ktype_count = 0,  
  tgs_ktypes = 0x0, tgs_ktype_count = 0, os_context = 0x8117bf8,  
  default_realm = 0x811ba28 "NCC.DTCC.EDU", 0x8116a60, db_context = 0x0,  
  ser_ctx_count = 0, ser_ctx = 0x0, clockskew = 300, kdc_req_sumtype = 7,  
  default_ap_req_sumtype = 7, default_safe_sumtype = 8,  
  kdc_default_options = 16, library_options = 0, profile_secure = 0,  
  fcc_default_format = 1283, scc_default_format = 1283, prompt_types = 0x0} 

If you look at the backtrace from my previous message (portion below),
frame #4 has a valid pointer in context, but in frame three it has 0x1
and realm is 0x1 .. 

#3  0xb75ad622 in krb5_sendto_kdc (context=0x1, message=0x81214a8, realm=0x1,  
    reply=0xbfffb510, use_master=1) at sendto_kdc.c:97 
#4  0xb75961f3 in send_as_request (context=0x8117ba0, request=0xbfffb5d0,  
    time_now=0xbfffb510, ret_err_reply=0xbfffb594, ret_as_reply=0xbfffb598,  
    use_master=1) at get_in_tkt.c:117 

I don't see anything within that function that might alter context
unless it happens in kd5_locate_kdc().... (code snippet at end of msg)

>Can I get you to try setting the environment variable LD_ASSUME_KERNEL
>to 2.4.1 in the process that displays the problem?  Does that make
>things better?

Willing to do whatever I can, but not sure how to accomplish above. It
flows from xinetd calling imap which calls pam which calls
kerberos. How would i set it in that scenario and could it have
negative impact on other things that xinetd invokes?  Unfortunately
this is a production box that invokes tens of thousands of
authentications daily and only sees this two or three times a
day. I've been unsuccessful in reproducing it in a controlled
environment.

I'm an old tech support manager with limited programming experience,
so please have patience with my ignorance. I'm pretty rusty. :(

krb5_error_code
krb5_sendto_kdc (context, message, realm, reply, use_master)
    krb5_context context;
    const krb5_data * message;
    const krb5_data * realm;
    krb5_data * reply;
    int use_master;
{
    register int timeout, host, i;
    struct sockaddr *addr;
    int naddr;
    int sent, nready;
    krb5_error_code retval;
    SOCKET *socklist;
    fd_set readable;
    struct timeval waitlen;
    int cc;

    /*                                                                          
     * find KDC location(s) for realm                                           
     */

    if (retval = krb5_locate_kdc (context, realm, &addr, &naddr, use_master))
        return retval;
    if (naddr == 0)
        return (use_master ? KRB5_KDC_UNREACH : KRB5_REALM_UNKNOWN);

    socklist = (SOCKET *)malloc(naddr * sizeof(SOCKET));
    if (socklist == NULL) {
        krb5_xfree(addr);
        return ENOMEM;
    }
    for (i = 0; i < naddr; i++)
        socklist[i] = INVALID_SOCKET;

    if (!(reply->data = malloc(krb5_max_dgram_size))) {
        krb5_xfree(addr);
        krb5_xfree(socklist);
        return ENOMEM;
    }
    reply->length = krb5_max_dgram_size;

The malloc call above is where it hangs. krb5_max_dgram_size has 4096
in it. If I force a return in gdb from malloc and then try to evaluate
a call to malloc(4096) it returns fine.  If context is invalid, could
part of the stack be getting overwritten (although it appears the
return addresses are just fine inside the stack, so it's not a total
smash if so)

Perhaps I could change the code before the malloc to watch for 0x1 in
context and halt the process for debugging at that point, before the
bad malloc call? Can a running process reach out to a gdb and attach
to it?!  (or i could just send it into a cpu loop and then attach when
I see something running out of control).  As I said, ignorance perhaps! :)

thx for the concern...

ps, I have an open ticket with redhat on this too, but it's not
getting far. They are suggesting we try 1.3.1 from fedora core to see
if it solves the problem, which I'll probably install on saturday.

Also, this uses a windows 2000 server for KDC. It had done that for
over a year with no problems. This problem happened when we migrated
the server from redhat 7.3 to Redhat enterprise linux (RHEL) 3 over
the holidays.

-- 
Ken Weaverling (ken a.t weaverling.org) WHOIS: KJW  http://www.weaverling.org/
                     - - - - - - - - - - - - - - - - - -
Note: From address in posting is legit and may be replied to, but my reply may
be delayed since that address gets a lot of spam and I have to sort thru it :-(