Reducing context switch overhead

Number of threads Throughput

(normalized)

1 2 4 8 16 32 64

0 1 2 3 4

ConcurrentHashMap

ConcurrentSkipListMap

synchronizedHashMap synchronizedTreeMap

Figure 11.3. Comparing scalability ofMapimplementations.

11.6 Reducing context switch overhead

Many tasks involve operations that may block; transitioning between the running and blocked states entails a context switch. One source of blocking in server applications is generating log messages in the course of processing requests; to illustrate how throughput can be improved by reducing context switches, we’ll analyze the scheduling behavior of two logging approaches.

Most logging frameworks are thin wrappers aroundprintln; when you have something to log, just write it out right then and there. Another approach was shown inLogWriteron page152: the logging is performed in a dedicated back- ground thread instead of by the requesting thread. From the developer’s per- spective, both approaches are roughly equivalent. But there may be a difference in performance, depending on the volume of logging activity, how many threads are doing logging, and other factors such as the cost of context switching.16

The service time for a logging operation includes whatever computation is as- sociated with the I/O stream classes; if the I/O operation blocks, it also includes the duration for which the thread is blocked. The operating system will desched- ule the blocked thread until the I/O completes, and probably a little longer. When the I/O completes, other threads are probably active and will be allowed to ﬁnish out their scheduling quanta, and threads may already be waiting ahead of us on

16. Building a logger that moves the I/O to another thread may improve performance, but it also introduces a number of design complications, such as interruption (what happens if a thread blocked in a logging operation is interrupted?), service guarantees (does the logger guarantee that a success- fully queued log message will be logged prior to service shutdown?), saturation policy (what happens when the producers log messages faster than the logger thread can handle them?), and service lifecycle (how do we shut down the logger, and how do we communicate the service state to producers?).

the scheduling queue—further adding to service time. Alternatively, if multiple threads are logging simultaneously, there may be contention for the output stream lock, in which case the result is the same as with blocking I/O—the thread blocks waiting for the lock and gets switched out. Inline logging involves I/O and locking, which can lead to increased context switching and therefore increased service times.

Increasing request service time is undesirable for several reasons. First, service time affects quality of service: longer service times mean someone is waiting longer for a result. But more significantly, longer service times in this case mean more lock contention. The “get in, get out” principle of Section11.4.1 tells us that we should hold locks as briefly as possible, because the longer a lock is held, the more likely that lock will be contended. If a thread blocks waiting for I/O while holding a lock, another thread is more likely to want the lock while the first thread is holding it. Concurrent systems perform much better when most lock acquisitions are uncontended, because contended lock acquisition means more context switches. A coding style that encourages more context switches thus yields lower overall throughput.

Moving the I/O out of the request-processing thread is likely to shorten the mean service time for request processing. Threads callinglog no longer block waiting for the output stream lock or for I/O to complete; they need only queue the message and can then return to their task. On the other hand, we’ve intro- duced the possibility of contention for the message queue, but theputoperation is lighter-weight than the logging I/O (which might require system calls) and so is less likely to block in actual use (as long as the queue is not full). Because the request thread is now less likely to block, it is less likely to be context-switched out in the middle of a request. What we’ve done is turned a complicated and un- certain code path involving I/O and possible lock contention into a straight-line code path.

To some extent, we are just moving the work around, moving the I/O to a thread where its cost isn’t perceived by the user (which may in itself be a win).

But by movingallthe logging I/O to a single thread, we also eliminate the chance of contention for the output stream and thus eliminate a source of blocking. This improves overall throughput because fewer resources are consumed in scheduling, context switching, and lock management.

Moving the I/O from many request-processing threads to a single logger thread is similar to the difference between a bucket brigade and a collection of individuals ﬁghting a ﬁre. In the “hundred guys running around with buckets”

approach, you have a greater chance of contention at the water source and at the fire (resulting in overall less water delivered to the fire), plus greater inefficiency because each worker is continuously switching modes (filling, running, dumping, running, etc.). In the bucket-brigade approach, the flow of water from the source to the burning building is constant, less energy is expended transporting the water to the fire, and each worker focuses on doing one job continuously. Just as interruptions are disruptive and productivity-reducing to humans, blocking and context switching are disruptive to threads.

11.6. Reducing context switch overhead 245

Summary

Because one of the most common reasons to use threads is to exploit multiple processors, in discussing the performance of concurrent applications, we are usually more concerned with throughput or scalability than we are with raw service time. Amdahl’s law tells us that the scalability of an application is driven by the proportion of code that must be executed serially. Since the primary source of serialization in Java programs is the exclusive resource lock, scalability can often be improved by spending less time holding locks, either by reducing lock granu- larity, reducing the duration for which locks are held, or replacing exclusive locks with nonexclusive or nonblocking alternatives.

This page intentionally left blank

C hapter 12

Testing Concurrent Programs

Concurrent programs employ similar design principles and patterns to sequential programs. The difference is that concurrent programs have a degree of nonde- terminism that sequential programs do not, increasing the number of potential interactions and failure modes that must be planned for and analyzed.

Similarly, testing concurrent programs uses and extends ideas from testing sequential ones. The same techniques for testing correctness and performance in sequential programs can be applied to concurrent programs, but with concurrent programs the space of things that can go wrong is much larger. The major chal- lenge in constructing tests for concurrent programs is that potential failures may be rare probabalistic occurrences rather than deterministic ones; tests that disclose such failures must be more extensive and run for longer than typical sequential tests.

Most tests of concurrent classes fall into one or both of the classic categories of safetyandliveness. In Chapter1, we deﬁned safety as “nothing bad ever happens”

and liveness as “something good eventually happens”.

Tests of safety, which verify that a class’s behavior conforms to its specifi- cation, usually take the form of testing invariants. For example, in a linked list implementation that caches the size of the list every time it is modified, one safety test would be to compare the cached count against the actual number of elements in the list. In a single-threaded program this is easy, since the list contents do not change while you are testing its properties. But in a concurrent program, such a test is likely to be fraught with races unless you can observe the count field and count the elements in a single atomic operation. This can be done by locking the list for exclusive access, employing some sort of “atomic snapshot”

feature provided by the implementation, or by using “test points” provided by the implementation that let tests assert invariants or execute test code atomically.

In this book, we’ve used timing diagrams to depict “unlucky” interactions that could cause failures in incorrectly constructed classes; test programs attempt to search enough of the state space that such bad luck eventually occurs. Unfortu- nately, test code can introduce timing or synchronization artifacts that can mask bugs that might otherwise manifest themselves.1

1. Bugs that disappear when you add debugging or test code are playfully calledHeisenbugs.

247

Liveness properties present their own testing challenges. Liveness tests in- clude tests of progress and nonprogress, which are hard to quantify—how do you verify that a method is blocking and not merely running slowly? Similarly, how do you test that an algorithm doesnotdeadlock? How long should you wait before you declare it to have failed?

Related to liveness tests are performance tests. Performance can be measured in a number of ways, including:

Throughput: the rate at which a set of concurrent tasks is completed;

Responsiveness: the delay between a request for and completion of some action (also calledlatency); or

Scalability: the improvement in throughput (or lack thereof) as more resources (usually CPUs) are made available.

Adding functionality to existing thread-safe classes

Blocking queues and the producer-consumer pattern