Design for Multithreading MIT KDC Rahul S. [EMail: srahul@novell.com] 31 March 2005 1. Need for a multithreaded KDC ---------------------------- The MIT KDC server is single threaded. The requests received by the KDC are processed sequentially. This is not a problem since the principal information is stored in a local database. When the KDC is integrated with a directory backend (eDirectory), the Kerberos principals will be stored in the directory. In such a scenario, for every request, the KDC will lookup the directory to find Kerberos principals. Directory lookups are more time consuming than local database lookups. Because of this, the (single threaded) KDC will spend a lot of time in principal lookups and its performance will be hit. In order to improve the performance, the KDC has to process requests concurrently. This document describes a design for multithreading the KDC and enabling it to service KDC requests concurrently. The following design is applicable to version 1.3.5 of the MIT Kerberos distribution. But, since the KDC part of the code has not changed much from version 1.3.5 to 1.4, the design should be applicable to version 1.4 also with minor modifications. 2. Issues that need to be addressed -------------------------------- The following are some of the data structures that have to be made thread-safe. 2.1. The active realm The 'kdc_active_realm' variable is a pointer to the realm specific data of the service principal corresponding to the request being processed currently. This variable is set just before processing an AS / TGS request. The members of the structure pointed at by 'kdc_active_realm' are accessed at many places in the KDC code while processing the packet. In a multithreaded scenario, each thread processing a request may need to refer to a different realm 2.2. 'sstate' The 'sstate' structure contains the set of file descriptors (FDs) that have to be monitored by the KDC for activity (availability of incoming / outgoing data). FDs are added to this set when a new (TCP) request is to be processed and removed when a TCP request has been processed. The TCP requests to the KDC may arrive in multiple TCP segments. In a multithreaded scenario, the following three cases are to be handled. 1. When a new TCP connection is established, the KDC must immediately start listening on that FD. 2. When a TCP connection is closed, the KDC must immediately stop listening on that FD. 3. When a thread is servicing a TCP FD, no other thread must start servicing that FD when the next TCP segment is received on that FD. 2.3. 'connections' 'connections' is a set of structures holding the information about each active FD. In a multithreaded scenario, access to this set has to be controlled. 2.4. Replay cache 'kdc_rcache' is the replay cache which stores the Kerberos authenicators of processed requests. When a new request is received with a cached authenticator, the KDC has to reject the request and return the KRB_AP_ERR_REPEAT error [1]. In a multithreaded scenario, the replay cache could be updated concurrently by two separate threads. This gives rise to the following two requirements. 1. The functions modifying the replay cache data structures have to be thread safe. 2. If two requests with the same authenticator are being processed concurrently by two separate threads, atmost one of them should succeed. 2.5. Lookaside cache 'root_ptr' is the (head of the list for) replay lookaside cache. This cache stores recently received (successfully authenticated) requests and their corresponding replies. It is looked up before processing a request and updated after processing a request. While updating, duplicates are not checked for. In a multithreaded scenario, this cache has to be protected in the following two ways. 1. Access to this cache by multiple threads has to be controlled. 2. Care should be taken to ensure that two threads processing the same request concurrently do not both update the cache. 2.6. Global variables All global variables which are concurrently updated by multiple threads have to be protected using locks. The variables (excluding the ones mentioned above) in KDC are 1. last_usec, last_os_random 2. kdc_current_rcname 3. hits, calls, max_hits_per_entry, num_entries 4. tcp_data_counter, max_tcp_data_connections The variables in the libraries are protected as in version 1.4. 3. The design ---------- 3.1 Thread synchronization functions The implementation of the following functions are backported from version 1.4. 1. Initialization / destruction of mutexes 2. Locking / unlocking of mutexes 3. Creation / destruction of thread-specific data 4. Access to thread-specific data These functions are used in the libraries as in version 1.4. In the KDC code, the following change is made. A single mutex is used for making the entire KDC code threadsafe. So, the mutex is redefined to be recursive. In order to do this, the threading library's API is used if it is supported. Otherwise, a recursive mutex is implemented using a normal mutex and a counter. 3.2. Per-thread information A new structure thread_info is defined to store per-thread information. A new thread-specific data key 'K5_KEY_KDC' is defined. This key is registered during startup and used by each service thread for accessing thread-specific data from its thread-specific data area. The function 'k5_getspecific' (internally 'pthread_getspecific' on linux), when called inside a thread, accepts the key as an argument and returns a pointer to the thread's thread-specific data (of type 'thread_info'). The definition of 'thread_info is as follows. struct thread_info { struct connection *current_conn; kdc_realm_t *current_realm; }; The member 'current_conn' points to the connection structure over which the current request was received. The member 'current_realm' points to the active realm (intended to replace 'kdc_active_realm') Every thread can use its thread_info structure at any point in its execution to locate the connection it is handling and the corresponding active realm. It is also possible to accomplish the same by passing the connection and realm information as arguments to all functions that use them. This would require changing the definitions of all such functions. 3.3. kdc_active_realm 'kdc_active_realm' is changed whenever a request is received and is referred to while processing the request. If multiple threads are processing requests concurrently, they should be able to refer to one of the members of the 'kdc_realmlist[]' array as the active realm. The active realm corresponding to the request being serviced by a thread is pointed at by the 'current_realm' member of the 'thread_info' structure specific to the thread. The global variable 'kdc_active_realm' is removed and a new function 'get_active_realm()' is added. This function looks up the calling thread's thread-specific data and returns a pointer to the active realm. Macros like 'kdc_context' refer to members of 'kdc_active_realm'. These macros are used in many places in the code. In every function referring to these macros, a local variable called 'kdc_active_realm' is declared and initialized by calling 'get_active_realm()'. 3.4. Signal handlers The signal handlers should be executed exactly once per signal received (in the context of the main thread). The handlers must not be executed in the context of each thread separately. To ensure this, all signals that are not used by the multithreading library are blocked in all threads except the main thread. 3.5. Protecting variables Access to non thread-safe global variables in the kdc code is protected by a single recursive mutex (another alternative is to have a separate mutex for each variable). 3.6. Thread creation and request handling Currently, the function 'listen_and_process()' waits for activity on all incoming ports (using the 'select' system call). Whenever activity is detected on any incoming port, it calls service_conn() to service that port. The 'connection' structure will be modified to include the following members. 1. u.udp.addr_s : Source address of a UDP request 2. u.udp.bufsiz : Size of the buffer holding a UDP packet 3. u.udp.buffer : Pointer to a UDP packet 4. u.udp.msglen : Length of the UDP packet The use of these new members are explained below. 'listen_and_process()' will be modified to spawn n threads as soon as it is called. The entry point of each thread will be the 'service_thread()' function. A global queue of instances of the following structure will be created with 'req_port_head' as the front of the queue. struct req_port { struct req_port *next; struct connection *conns; char *prog; int sflags; }; A monitor 'm' will control access to this queue. Whenever a new port is to be served, a new 'req_port' structure is created and added to the end of this queue after locking 'm'. The queue has a maximum length. If a request is to be added to the queue when the queue is full, the main thread (the one executing 'listen_and_process()') will block till another thread removes a request from the queue. If the queue is empty, all threads ready to process requests will wait till the main thread adds a request to the queue. This is implemented using 'm'. If the KDC is built with the multithreading option, the following algorithm will be implemented by the main thread. 1. Q = {Set of KDC requests awaiting service} 2. S := {Set of ports to poll for KDC requests. Initially, 88 (TCP and UDP)} 3. Create n service threads. 4. Poll S for activity using select() 5. If select() was interrupted by a signal, go to step 4 6. Detect activity on port p 7. If p is a UDP port, go to step 11 8. S := S - {p} 9. Create a new req_port structure (req) for port p. 10. Go to step 14 11. Read the UDP packet, say 'pkt' 12. Create a new 'connection' structure 'c'. c.u.udp.addr_s := Source address of UDP packet c.u.udp.buffer := pkt Also modify c.u.udp.msglen, c.u.udp.bufsiz 13. Create a new req_port structure (req) for port p using 'c' in place of the UDP connection structure. 14. Lock m 15. Q := Q + {req} 16. Unlock m 17. Go to step 4 The service thread : ------------------ The service_thread() function will implement the following algorithm. 1. Q = {Set of KDC requests awaiting service} 2. S = {Set of ports being polled for KDC requests by main thread} 3. Lock m (m is a monitor which can be locked only when the queue is non-empty) 4. Remove a request from Q 5. Unlock m 6. If request is over UDP, go to step 12 7. Read the request from the corresponding TCP port (say p) 8. Service the request (accept_tcp_connection() / process_tcp_connection()) 9. S := S + {p} 10. Send a signal to the main thread (to interrupt the 'select' system call) 11. Go to step 3 12. Service the request (process_packet()) 13. Go to step 3 The 'process_packet()' function is modified to process the request in the buffer pointed at by a member of the 'connection' structure passed to it rather than read the packet from the UDP port. In the above algorithm, the main thread reads UDP packets from the UDP FD but not TCP packets. This enables the main thread to go back to listening on the UDP socket. In case of TCP requests, two threads must not end up processing the same TCP FD for two different TCP segments corresponding to the same request. So, the FD is not 'selected' by the main thread while it is being processed. The reason for making the main thread receive the requests over UDP is as follows. 1. If the service thread receives the request, it will have to notify the main thread after receiving it. The main thread will then include the UDP port in its list of 'select'ed FDs. This would cause an extra context switch. 2. The main thread could queue up more requests than there are threads when there is a burst of requests. 3.7. The replay cache The replay cache is used for recognizing and discarding replayed KDC requests with the same authenticators. The replay cache functions are made thread safe by using locks for protecting the sensitive datastructures as in version 1.4. 3.8. The lookaside cache The existing lookaside cache does not prevent the same requests from being processed concurrently by two threads (due to replay) if the second thread looks up the cache before the first one updates it. This problem is solved as follows. 1. Lock a mutex 'm' 2. Check lookaside cache 3. If request is not found in the cache, go to step 8 4. Unlock 'm' 5. If 'reply_packet' member of the cache entry is NULL, return without any further processing (a concurrent thread is processing the request). 6. Use the 'reply_packet' member as the reply and do not re-process the request. 7. Return. 8. Create a 'krb5_kdc_replay_ent' with proper values set for all members of the structure except 'reply_packet'. Set reply_packet := NULL 9. Unlock 'm' 10. Process the request. 11. Lock 'm' 12. Set the 'reply_packet' member of the cache entry to the proper KDC reply. 13. Unlock 'm' 14. Return 3.8. gethostbyname() The gethostbyname() function returns a pointer to static data structures. These structures are changed on subsequent calls. The calls to these functions have to be protected in one of the following two ways. 1. Protect the values returned by gethostbyname() using mutexes while they are being used in the calling function. 2. If gethostbyname_r() is available, use it instead of gethostbyname(). Though the GET_HOST_BY_NAME() macro, which refers to gethostbyname_r(), is available, the calling function has to ensure that the stack does not get corrupted by nested function calls between the invocation of GET_HOST_BY_NAME() and the access of the value returned by it. 4 Alternate Design Approaches Considered -------------------------------------- 4.1 Alternate Approach 1 In this approach, at most one thread of KDC server will be running at any point of time. When the thread makes a DB lookup for principal information, another thread can begin processing a new request. Before a new thread starts executing, the entire context of the previous thread (in terms of values of global variables) is saved. The context is restored when the thread making the DB lookup begins to execute again after the DB lookup has been completed. This approach allows a slow DB backend like a directory to be used with the KDC. But, the maximum rate at which requests can be processed by this approach is 1/((time for processing request) - (DB lookup time)) Hosting the KDC server on multiprocessor systems would not improve the performance. 4.2 Alternate Approach 2 In this approach a new process will be forked to service every new request. The new process will exit after processing its request. But forking a new process for each new request is a costly operation compared to having an existing thread service the request. 4.3 Alternate Approach 3 In this approach, n threads will be forked when the KDC server starts. Each thread will listen for requests using the select system call. When a request arrives, one of the free threads receives and processes it. Synchronization of the threads here is more diffucult than in the proposed approach. 5. References ---------- [RFC 1510] J. Kohl and C. Neuman, "The Kerberos Network Authentication Service (V5)", September 1993