Trip over threads to trap multicore bugs -

Trip over threads to trap multicore bugs


This Multicore Expo paper accompanies the class “ME820: Make your multicore program fail again” to be held on May 3, 2011 in San Jose, CA. Multicore Expo, May 2 to 4, will be held in conjunction with the Embedded Systems Conference Silicon Valley 2011.

This paper is from a class at the Multicore Expo 2011. Click here for more information about the conference.

What makes debugging of multiprocess and multithread applications so difficult?

The first thing that comes to mind of every concurrent programmer is the lack of program execution reproducibility. The reason for such program behavior is the preemptive scheduling employed by real-time operating systems.

Let us take a look at the effect of preemptive scheduling of tasks on process behavior.

When and how long each task runs and waits for the CPU is affected by a large number of unrelated asynchronous events. For example, the operating system response to pressing a key on a keyboard results in a hardware interrupt. The interrupt service routine—a high-priority task—can preempt the running thread at a machine instruction that is being executed at the moment of the interrupt.

A single-threaded application runs sequentially, and the order of execution is fully defined by the program flow. When a task is preempted by the scheduler and placed into a waiting queue, the randomness of the event does not affect the behavior of the process. When the task regains access to the CPU, it continues from the instruction at which it was stopped.

For example, in run #1 in Figure 1 below, the imaginary sequence of instructions “ABCD” is executed atomically. In run #2 a context switch occurs after instruction “B” is executed. The task resumes later on, continuing with “C”. In run #3 the task is preempted twice.

Click on image to enlarge.

Figure 1. Imaginary sequence of instructions

Regardless of the run/wait pattern in time, machine instructions are always executed in the same fixed order. Therefore, the progress of the task can be analyzed instruction by instruction, and this makes debugging of the code straightforward.

In contrast, the behavior and outcome of a concurrent process is strongly affected by preemptive scheduling, even when the process is running on a single-core system. Each thread in an application follows its unique and unpredictable run/wait pattern.

Let's modify the example above, assuming that instructions “AC” and “BD” are to be executed by two threads running concurrently. As shown in Figure 2 below, every time a program runs, the order of instruction execution within each thread remains the same, but the overall order of execution is unpredictable and is likely to change from run to run. Execution of the same program three times may result in three completely different sequences, for example, “ABDC”, “BACD”, and “BDAC”:

Click on image to enlarge.

Figure 2. “AC” and “BD”executed by two threads running concurrently.

Considering the size and complexity of software applications, the number of possible permutations in execution order is enormous. Most permutations might have no effect on the process outcome. Yet some execution scenarios result in undesirable program behavior.

And this is why debugging of multiprocessor applications is so difficult. Occurrence of intermittent failures triggered by a particular execution schedule presents a huge challenge. An intermittent failure may not be captured during tests and software may run successfully for years before a bug reveals itself. Worse, capturing such failure does not help with debugging because there is no mechanism to reproduce the execution schedule that led to it.

Today we have at our disposal various programming techniques for process and thread synchronization. Extensive use of these techniques, however, results in serialization of a multithreaded code which degrades performance. Further, these techniques alone do not make the software bug-proof.

To fix a bug a programmer must be able to reproduce it. It is not uncommon for an engineer to re-run a multithreaded program under the debugger multiple times, hoping to invoke the buggy behavior. This is not a very efficient method, especially when the probability of the buggy behavior is low. Unfortunately, there is not much choice: few tools are available to developers of multithreaded software, and today none of them fully addresses the major issue described above, the lack of process reproducibility.
Taking control of scheduling
In order to successfully debug a concurrent application, we have no choice but to take over the functions of the OS scheduler, thus removing the execution uncertainty described in the previous section. A novel OS scheduler simulator called maze will do that for us. As shown in Figure 3 below, this simulator creates a “layer” between the running application and the OS, taking full control over scheduling of all threads and processes in the application. Users are now able to find the exact sequence of instructions that preceded a failure, and to reproduce this sequence.

The simulator runs a program multiple times, each time with a unique thread run/wait pattern. A user can choose a “faulty” run and reproduce the program behavior for debugging. The difference between a process running “on its own,” and the process running under the simulator is in its scheduling with respect to other processes and threads within the same application.

When a process runs on its own, the schedule is affected by the states and priorities of all processes running on the machine. This schedule cannot be controlled by a user, recorded, or reproduced.

Click on image to enlarge.

Figure 3. Maze simulator creates a “layer” between the running app and the OS.

On the other hand, as shown in Figure 4 below, when a process is controlled by the simulator, its schedule is unaffected by any processes unrelated to the application. The schedule is deterministic; it can be reproduced on request. If the process creates a child process or a thread, the simulator automatically takes control over the new process as well. All processes and threads of the application run, wait, or sleep by following directives from the simulator.

Click on image to enlarge.

Figure 4. When a process is controlled by the simulator, its schedule is unaffected by any processes unrelated to the application

So what exactly happens to a process when it is run under the scheduler simulator? The process runs normally until it creates a second thread, or forks a child process. At this point the simulator takes over both processes.

If these processes spawn more threads or child processes, they also fall under control of the simulator. The simplified simulation step is as follows:

  1. Randomly select the next thread to run
  2. Randomly select the number of instructions to be executed in the thread
  3. Run the thread until the selected number of instructions are executed, or until the thread blocks
  4. Go back to (1)

The simulation continues until (a) all threads but one, and processes run to completion or get terminated, or until (b) all threads and processes block.

In the Case (a), a single-threaded process may continue uncontrolled since there is no concurrency in the system any longer. If it creates more threads, the simulator will take over the scheduling again.< Case (b) is of special interest to us. Threads may block for a variety of reasons. One simple example is a deadlock: two threads are waiting for a mutex taken by the other thread. We will look at this example in more detail later on. There may be any number of processes and threads waiting for something and never continue to completion.  They may be waiting for a signal, for I/O, or for another event that will not occur, or for a resource that won't become available – because no process in the system is able to proceed any further.

In real life the only way out of the situation is to terminate all blocking processes and start the application anew. Needless to say, this gives no insight into what actually happened in the system. Worse, this situation cannot be reliably reproduced. And of course when the application is run under a debugger or as a part of a regression suite, the parent application also blocks.

If the parent application is maze , things are different. Maze keeps the threads under tight control, and as soon as there are no more threads and processes capable of running, it stops at once. Moreover, since the controlled processes are alive, it can give the programmer a lot of diagnostic information. It can tell why each of the threads is blocking and print its stack and the registers.
Making it reproducible
The random selection of a thread to run and of a number of instructions to execute is not truly “random”. The simulator uses a pseudo-random number generator, which makes each simulation run entirely deterministic.

The simulator can run an application multiple times, each time generating a unique thread and process scheduling scenario. This allows users to stress-test the application, and catch hard-to-reproduce race conditions, deadlocks, and segmentation faults. This mode of operation is called the TEST mode.

Another mode of the simulator operation is called the RECONSTRUCTION mode. In this mode the simulator runs the application once, reproducing the scheduling scenario of any run from the TEST mode session.

Let's take a look at the implementation of a scheduler simulator on an X86 Linux platform.

The command:

$ maze -r 1000 foo

starts a session in TEST mode, running a program called foo repeatedly 1000 times. The subsequent command:

$ maze -R 791

starts the simulator in the RECONSTRUCTION mode. The simulator runs foo once, using the thread scheduling sequence of run #791 in the preceding TEST mode session, thus reproducing the program behavior, output, and computation results obtained during the run #791.

Example: detecting a deadlock
In Example 1 , I look at a “classic” deadlock between two threads. Each thread locks and then unlocks two mutexes: a “blue” mutex and a “gray” mutex. The first thread acquires the mutexes in the order “blue, gray”, while the second thread acquires them in the order “gray, blue”.

Click on image to enlarge.

Code Example 1. “Classic” deadlock between two threads

Consider the three possible sequences  the lock/unlock events in the two threads.

Click on image to enlarge.

Depending on timing the threads may run into a deadlock: the first thread has acquired the “blue” mutex and is waiting for the “gray” mutex, while the second thread has acquired the “gray” mutex and is waiting for the “blue” mutex.

Let's compile the code into a binary called deadlock and run it several times. If a deadlock occurs, the process blocks. Otherwise, it runs to completion. Now test the deadlock with the simulator, running it 100 times:

$ maze -r 100 deadlock

Most of the 100 runs complete successfully, yet as shown below, some of them reveal the deadlock:

ERROR: A deadlock was detected in the run # 59. ERROR: A deadlock was detected in the run # 88.

Code Example 2. Deadlock
The process blocks in the runs #59 and #88, but the simulator does not. As shown below in Code Example 3 below, it prints out the diagnostic information, and proceeds with the next run.

Click on image to enlarge.

Code Example 3. Diagnostic Information.
The above information is sufficient to diagnose the problem. Yet, if necessary, a run can be reproduced in the RECONSTRUCTION mode:

$ maze -R 59

This run will also result in a deadlock, thanks to the deterministic scheduling by the simulator.

Note that if we repeat the TEST mode simulation above again, the deadlocks will be detected again in the runs #59 and #88. The pseudo-random number generator always creates a reproducible sequence of numbers, leading to the reproducible simulation results. If we give the generator a different seed number, the simulator will produce different scheduling sequences. We will see the deadlocks detected in different runs too.

The behavior of a concurrent application is often unpredictable due to the non-deterministic nature of CPU sharing in a multitasking operating system. The operating system interleaves the execution of all existing processes. Process scheduling is performed by the OS, and a user program in general has no control over it.

A process schedule is affected by many different asynchronous events occurring in the system. As a result, the flow of a multithreaded application may vary from run to run. Because inevitable bugs in such an application manifest themselves intermittently and are hard to reproduce, they remain undetected for a long time.

I have presented a simulator of the OS scheduler that creates a “layer” between the running application and the OS, taking full control over scheduling of all threads and processes in the application. Running a user application repeatedly, the simulator generates various thread scheduling sequences, similar to those that occur in the real world. If a bug is detected during a run, the simulator can reproduce the buggy behavior of the application exactly.

To learn more about these techniques attend Roni Simonian’s class at the Multicore Expo, titled: “Make your multicore program fail again.

Roni Simonian has over 13 years of software development experience in electronic design automation and computer graphics industries. In 2008 she left EDA to found Ariadne LLC and develop Maze, a tool for testing and debugging parallel applications. Roni has M.S. in computer science from SUNY at Stony Brook and a B.S. in physics and applied math from Moscow Institute of Physics and Technology. For more information about Roni Simonian, go to her web site at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.