Fixing concurrency defects in multicore design -

Fixing concurrency defects in multicore design

There are compelling reasons to move to multicore processor designs, but the risk is that doing so introduces the risk of concurrency defects in the software. These are easy to introduce — even apparently innocent code can harbor nasty multithreading bugs — and notoriously difficult to diagnose and eliminate when they occur.  GrammaTech's Paul Anderson presents some techniques to mitigate the difficulties of finding and eliminating bugs.

Having reached the limits of performance gains that can be realized from miniaturization and integration, microprocessor manufacturers are more and more frequently choosing multicore processors as the best approach to increased performance in computationally intensive embedded systems.

However, realizing the potential performance gains with multicores has a cost. Software written to be single-threaded for a single core processor will realize little or no performance benefit when executed on a multicore processor. To take advantage of the cores it must be rewritten or adapted to use multithreading. 

The key challenge is to keep the cores busy as much as possible, while ensuring that they coordinate access to shared resources properly. Unfortunately writing such code is much harder than writing single-threaded code. Concurrent code is vulnerable to entirely new kinds of programming defects, such as deadlocks or race conditions, and when these occur, they can manifest in ways that are difficult to diagnose. Traditional techniques for finding and eliminating concurrency errors may be ineffective.

A principal reason concurrency bugs are so difficult is because there is an enormous number of ways in which the events in the threads can be interleaved when those threads execute. As the number of threads or instructions increases, the number of interleavings increases exponentially. If thread A executes M instructions and thread B executes N instructions, there are N+MCN possible interleavings of the two threads. For example, given two trivial threads with ten instructions each, there are 184,756 possible interleavings of those instructions. Even with very small programs it is clear that it is next to impossible to test all combinations. Secondly, even if it is feasible to identify a single interleaving that leads to a failure, it can be very difficult to set up a repeatable test case that uses that particular interleaving because scheduling of threads is effectively nondeterministic. Consequently debugging concurrent programs can be very expensive and time consuming.

Insidious race conditions
Of all the concurrency defects that can occur, race conditions are probably the most pernicious. Race conditions have been responsible for numerous system failures over the years. The famous case of the Therac-25 radiation therapy machine featured a race condition that was responsible for the deaths of several patients [2]. Similarly, the 2003 Northeast blackout was exacerbated by a race condition that resulted in misleading information being communicated to the technicians [3].

There are several different kinds of race conditions. One of the most common and insidious forms — data races — is the class of race conditions involving concurrent access to memory locations. A data race occurs when there are two or more threads of execution that access a shared memory location, at least one thread is changing the data at that location, and there is no explicit mechanism for coordinating access. If a data race occurs it can leave the program in an inconsistent state.

An example of a data race is shown in Figure 1 . A manufacturing assembly line has entry and exit sensors, and maintains a running count of the items currently on the line. This count is incremented every time an item enters the line, and decremented every time an item reaches the end of the line and exits. If an item enters the line at the same time that another item exits, the count should be incremented and then decremented (or vice-versa) for a net change of zero. However, normal increment and decrement are not atomic operations: they are composed of a sequence of individual instructions that first load the value from memory, then modify it locally, and finally store it back in memory. If the updating transactions are processed in a multithreaded system without sufficient safeguards, a race condition can arise because the sensors read and write a shared piece of data: the count. The interleaving in Figure 1 results in an incorrect count of 69. There are also interleavings that can result in an incorrect count of 71, as well as a number that correctly result in a count of 70. 

Figure 1: Race condition leads to incorrect count of items on an assembly line.

This example neatly illustrates one of the properties of data races that makes them hard to diagnose: the symptom of corruption may only be observable long after the data race has occurred. In this case the fact that the count is wrong may only be noticed when the variable has the incorrect value of -1 after all items have exited the assembly line.

A widely-held belief is that some instances of data races are benign and can be tolerated. However this is not true except in rare circumstances. The C standard states unambiguously that compilers are allowed to assume that there are no data races, so optimizers can and do make transformations that are valid for single-threaded code, but which introduce bugs when there are apparently benign data races. These are subtle effects — even experienced programmers are regularly surprised by them. 

For example, consider a data race involving a single variable that is read by one thread and written to by another, and where the thread that reads the value does not care whether it sees the value before or after it has been changed by the other thread. It is in fact possible that the reader gets a value that is neither! One way in which this might happen is if the hardware is not capable of changing the value in one atomic step, such as a 64-bit variable on a processor that only supports 32-bit memory operations. The read and write will each take two instructions, so if the instructions in each thread are badly interleaved then the reader might see 32 bits of the old value and 32 bits of the new value. There are other reasons why this might happen — the compiler is allowed to assume that there are no data races, so it is free to generate code that relies on that assumption, and such code may introduce defects when a data race is present.

There are many other cases where apparently benign data races can lead to serious defects due to compiler optimizations. Boehm gives an excellent explanation along with several compelling examples [1].

It is tempting for programmers to attempt to avoid data race problems by using the volatile keyword. However that construct was not designed for that purpose, and it is also one of the most commonly misunderstood aspects of the C language. Compiler support for volatile is spotty at best, so it is unwise to rely on using it. The only way to be confident that a data race is benign is to inspect the machine code that was generated by the compiler to be sure it is doing what the programmer intended. This is something that should be left to only the most trusted and competent experts.

Compiler optimizations and data races don't mix
Because compilers are allowed to assume that there are no data races in code, it is perfectly legal for them to make transformations to the code that allow seemingly innocent data races to cause unpleasant bugs.

The best-known example is the singleton pattern. This is a commonplace idiom where the program maintains a reference to a single underlying object, and a Boolean variable encodes whether it has been initialized. This pattern is also known as “lazy initialization”.

The following code is an example of the pattern.

if (!initialized) {
    object = create();
    initialized = true;
… object …

This code is adequate for a single-threaded program, but it is not multiple thread safe because it has a data race on the variable named initialized . If called by two different threads there is a risk that both threads will see !initialized at virtually the same time, and both will call create() , thus violating the singleton property.

To make this thread safe, the natural approach is to protect the entire if statement with a lock. However, acquiring and releasing locks is expensive, so programmers try to avoid the expense by using the “double-checked locking” idiom — a check outside the lock and another inside. The inner check is there to confirm that the first check still holds after the lock has been acquired.

if (!initialized) {
    if (!initialized) {
        object = create();
        initialized = true;
… object …

Superficially this looks like it will suffice and indeed it will as long as the statements are guaranteed to execute in that order. However, an optimizing compiler may generate code that switches the order of object = create() and initialized = true . After all, there is no dependence between those two statements. In that case, if a second thread enters this code any time after the assignment to 'initialized', that thread will then use the value of object before it has been initialized.

Consequently, to achieve high levels of assurance, the best practice is to find and remove all data races. Bug finding and remediation is a standard activity, but the unique properties of concurrency bugs mean that the usual techniques can be ineffective. 

Techniques for eliminating concurrency errors
Traditional dynamic testing is not well suited for finding many concurrency defects because of non-determinism. A program that passes a test hundreds of times may later fail in the same environment with exactly the same inputs because the bug can be exquisitely sensitive to timing. Engineers looking for high assurance must turn to other techniques if they are to be effective at eliminating concurrency defects.

Static analysis tools offer a means for finding such bugs. The key difference between testing and static analysis is that testing exercises a particular execution of a program for a given set of inputs, whereas static analysis finds properties that are good for all possible executions and all inputs. In practice tools make approximations to achieve acceptable performance and precision, so fall short of this ideal model. Nevertheless, they do cover many more cases than would ever be possible with traditional testing.

Roughly speaking, static analysis tools work by creating a model of the program and by doing a symbolic execution of that model, looking for error conditions along the way. One approach finds data races by creating a map of which locks are held by which threads, and reasoning about the possible interleavings that could result in unsynchronized access to shared variables [5]. The approach is path-sensitive, meaning that it is capable of computing different properties for different paths of execution through the code. Where possible, the analysis can determine which paths are infeasible, and will eliminate consideration of those in the analysis. Deadlock and other concurrency defects (including lock mismanagement) are found using similar techniques.
Because concurrency errors are so subtle, it is important that the explanation given by the analysis is easily understandable by the programmer. With many concurrency bugs, the analysis must consider multiple paths executing simultaneously. When a data race is detected, the report shows a path through each of two threads, with important points of interest highlighted. An example report is shown in Figure 2 .

Figure 2:  A screenshot showing a report from a static-analysis tool on code containing a data race.

Custom concurrency constructs: a case study 
Standard defect detection techniques are most useful when programs use standard ways of doing concurrency. Most tools recognize and have knowledge of the special properties of standard libraries such as the POSIX threads library or proprietary interfaces such as VxWorks. However, many systems use custom techniques for managing concurrency, so custom analyses are needed. 

An important property of modern advanced static analysis tools is that they are extensible: they provide an API with abstractions that make it convenient to implement custom static-analysis algorithms. The API gives the programmer access to the underlying program model, including the program’s control-flow graph (the CFG), the whole-program call graph, as well as the symbol table and the abstract syntax tree. In addition, operations are provided through a visitor pattern interface that make it relatively easy for a user to author a new analysis by piggybacking on the traversals that the existing analyses are already doing.

For example, Rigdon [4] describes a medical device built on a platform that uses a custom preemptive multithreaded software interface. In this design a key constraint is that all data instances that could be accessed from multiple priority levels of threads must be protected with proper guard constructs. Prior to using static analysis, validating that this constraint was respected required a person-month of manual analysis. They turned to static analysis to reduce the cost. 

Using the static analysis API with the features described above [5], it was possible to write a plug-in that they now use on a regular basis to find violations of the key constraint automatically, thus eliminating most of what had been done manually. It was cost effective to do this because the static analysis tool provided all of the infrastructure necessary to make it easy to run the analysis, including integration with the build system, a reporting mechanism that stores warnings in a database, and a web-client based user interface that allows engineers to collaborate on investigating and responding to violations.

There are compelling reasons to move to multicore processor designs, but doing so introduces the risk of concurrency defects in the software. These are easy to introduce — even apparently innocent code can harbor nasty multithreading bugs — and notoriously difficult to diagnose and eliminate when they occur. Traditional testing techniques alone are inadequate to ensure high-quality software, mainly because of the high degree of non-determinism. Static analysis is one approach that can help because it can reason about all possible executions of the code. Tools can find defects such as data races and deadlocks in code that uses standard multithreading libraries, and can even be adapted to designs that use non-standard concurrency constructs.

1. Boehm, H.-J., How to miscompile programs with “benign” data races . In HotPar '11 Proceedings of the 3rd USENIX conference on Hot topics in parallelism.
2. Leveson,N.G., An investigation of the Therac-25 accidents . IEEE Computer, 1993. 26: pp. 18-41.
3. Poulsen,K., Tracking the blackout bug ,
4. Rigdon, G., Static Analysis Considerations for Medical Device Firmware . In Embedded Systems Conference (ESC), Boston, September 2010.
5. CodeSonar Overview:

Paul Anderson  is VP of Engineering at GrammaTech. He received his B.Sc. from Kings College, University of London and his Ph.D. in computer science from City University London. Paul manages GrammaTech’s engineering team and is the architect of the company’s static analysis tools. Paul has worked in the software industry for 20 years, with most of his experience focused on developing static analysis, automated testing, and program transformation tools. He can be contacted at

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.