Multithreading the KDC server - a design

Mon Apr 4 13:32:29 EDT 2005

On Friday, April 01, 2005 8:10 AM Rahul Srinivas [srahul at novell.com] wrote:

> Hi,
>    I have come up with a design for multithreading the MIT KDC server.
> The design is attached. I have listed the issues that need to be
> addressed while converting the single threaded KDC into a  multithreaded
> one and have proposed solutions. Please comment.
>
> Regards,
>
> -Rahul S.

I forwarded the proposed design to one of the OpenVMS threads experts for analysis.  I pointed out that we'd prefer that the implementation avoid forking if possible, and sent some of my concerns about the proposal.  The following are his comments.  The majority of them are not OpenVMS-specific, so (with his permission) I'm sharing them with this mailing list.  Items in brackets are my additions for clarity, and I've made a few minor changes for consistency.

	Wayne Morrison
	Kerberos & CDSA/Secure Delivery Project Leader
	OpenVMS Engineering, HP

As a reference, the following sections on the [OpenVMS] Guide to [POSIX] Threads Library http://h71000.www7.hp.com/doc/73final/6493/6101PRO.HTML might be useful:

Section 1.5 Programming Issues for Multithreaded Programs
Section 3.1 Designing Code for Asynchronous Execution
Section 3.2 Memory Synchronization Between Threads
Section 3.3 Sharing Memory Between Threads
Section 3.7 Granularity Considerations
Section 3.9 Managing Dependencies Upon Other Libraries

I have a few comments on the design. I trust you understand I have no idea what most of the design is talking about -- the application-specific parts. Thus I cannot comment on most things that might or might not be missing, etc.

I agree that forking should be avoided on OpenVMS. Too expensive.

1) Section 2.6 begins with "All global variables which are concurrently UPDATED (emphasis added) by multiple threads...." If "update" means change or write, I will point out that locking may be required on simple reads of global variables as well. I assume that there are multiple such variables, and that some of those variables are co-dependent. Thus, any time a service thread goes to read some such inter-related variables, the variables' values need to be consistent. E.g.,. a queue and a count of elements in the queue had better agree. If would be bad for the queue to have two elements while the count was one. Threads reading one or both of these variables would need to hold the lock governing the queue/counter, before being able to trust the value of either the queue or the counter.

I'll also mention that for some kinds of update operations (e.g., incrementing a counter) it is often more efficient to do the update (the read/add/write instructions) using a compiler-provided atomic builtin, rather than use a mutex to protect the read, add, write (which in source code might falsely appear atomic as "counter++" or something). If such a counter is being incremented in isolation, using an atomic builtin is probably optimal. If the counter is incremented in concert with changes to other globals, then it would be more efficient to take out the lock, update all the related variables (including the counter with a normal counter++ expression), and release the lock.

2) Section 3.1 says that there will be a single mutex used for making the entire KDC code thread safe. This may be a HUGE mistake; this is the largest issue I see in the document. With only one lock, ALL threads will be SERIALIZED by that lock. If you have 20 threads and they all need this one lock quite frequently, then often only one thread will be running. The amount of actual concurrency (on a multiprocessor) may be quite low. If this one lock is not used much, then this may not be an issue. But the more it is used, the more single-threaded and overhead-prone the application will be!

I suggest giving SERIOUS thought to this one-lock-for-all design. It might make the entire threading exercise yield little benefit.

3) Section 3.1 -- why is the single mutex to be recursive? The fact that there is proposed to be one single mutex does not lead, in my mind, to the conclusion that this mutex be recursive. Forgive me for saying so, but often the perceived need for a recursive mutex is a byproduct of an incomplete design. There are times when they are a great choice, and this may be one of them -- but the document does not explain why. I think the choice of a recursive mutex needs explanation.

Related point -- the document says that if the threading services do not support recursive mutexes, the application will simulate one with a normal mutex and a counter. How? I'm curious; it may not be as easy as one thinks.

4) Section 3.4 -- I do not know how well the OpenVMS C RTL's [Run Time Library] emulation of signal handling works. I have heard a fair number of complaints about it over the years. I do not know how well per-thread signal masks work for example. The threads library on OpenVMS does not know anything about UNIX-style signals. It is all done in the C RTL.

5) Section 3.5 mentions that an alternative is to have separate mutexes for each global variable. In general this is very attractive to me! More locks (i.e., fine-grained locks) usually yields higher concurrency and higher throughput.

In the extreme, however, one would probably NOT want separate locks for EACH global variable. If two or more global variables' values need to be coordinated, then a common lock for that set of related globals would be appropriate. E.g., queue A and counter A could be governed by lock A, while queue B and counter B could be governed by lock B (assuming that the two queues did not need to be synchronized).

6) Section 3.6 -- The document doesn't say now many service threads will be created by the default thread. If the application assumes that it has exclusive access to available CPUs (which most application do assume), then it's good to have at least as many threads as there are CPUs. Depending on how much the service threads interfere with each other, you may even want to have more service threads than CPUs -- 20% more maybe, not 5X more. If you have lots more threads than can run concurrently, you pay time slicing overhead. If you have fewer threads than can run concurrently, you're not making full use of the CPUs.

On recent versions of OpenVMS (V7.3 and later) the threads library uses SYS$GETSYI and the SYI$_ACTIVE_CPU_MASK item code to basically determine which CPUs (and thus how many) are available. Your application might do something similar to get a rough count of usable CPUs, and base the decision of how many service threads to create on that count.

7) Also in Section 3.6 are two algorithm lists -- one for the default thread and one for a service thread. Step 8 in the default thread, and step 9 in the service thread, are changing the "S" set of ports. Are those the same set? There is no locking shown, so if a shared "S" is used by multiple service threads and the default thread, they will step on each other. In other words, the updates to "S" may need to be made atomic via a lock.

8) Still in Section 3.6, in the algorithm for the default thread -- does step 16 (unlock m) trigger any sort of notification for waiting service threads? Default thread step 15 added work to "Q", so I presume that step 16's unlock of m is intended to pro-actively notify any service threads that were waiting for something to appear in Q. Without active notification (e.g., use a pthread condition variable), the service threads may not wake up at all, or at least in a timely manner.
      -- Tom (Thomas.Dahl at hp.com)