Trip over threads to trap multicore bugs

Roni Simonian, Ariadne

April 07, 2011

Roni Simonian, AriadneApril 07, 2011

This Multicore Expo paper accompanies the class "ME820: Make your multicore program fail again" to be held on May 3, 2011 in San Jose, CA. Multicore Expo, May 2 to 4, will be held in conjunction with the Embedded Systems Conference Silicon Valley 2011.

This paper is from a class at the Multicore Expo 2011. Click here for more information about the conference.
What makes debugging of multiprocess and multithread applications so difficult?

The first thing that comes to mind of every concurrent programmer is the lack of program execution reproducibility. The reason for such program behavior is the preemptive scheduling employed by real-time operating systems.

Let us take a look at the effect of preemptive scheduling of tasks on process behavior.

When and how long each task runs and waits for the CPU is affected by a large number of unrelated asynchronous events. For example, the operating system response to pressing a key on a keyboard results in a hardware interrupt. The interrupt service routine—a high-priority task—can preempt the running thread at a machine instruction that is being executed at the moment of the interrupt.

A single-threaded application runs sequentially, and the order of execution is fully defined by the program flow. When a task is preempted by the scheduler and placed into a waiting queue, the randomness of the event does not affect the behavior of the process. When the task regains access to the CPU, it continues from the instruction at which it was stopped.

For example, in run #1 in Figure 1 below, the imaginary sequence of instructions "ABCD" is executed atomically. In run #2 a context switch occurs after instruction "B" is executed. The task resumes later on, continuing with "C". In run #3 the task is preempted twice.

Click on image to enlarge.

Figure 1. Imaginary sequence of instructions

Regardless of the run/wait pattern in time, machine instructions are always executed in the same fixed order. Therefore, the progress of the task can be analyzed instruction by instruction, and this makes debugging of the code straightforward.

In contrast, the behavior and outcome of a concurrent process is strongly affected by preemptive scheduling, even when the process is running on a single-core system. Each thread in an application follows its unique and unpredictable run/wait pattern.

Let's modify the example above, assuming that instructions "AC" and "BD" are to be executed by two threads running concurrently. As shown in Figure 2 below, every time a program runs, the order of instruction execution within each thread remains the same, but the overall order of execution is unpredictable and is likely to change from run to run. Execution of the same program three times may result in three completely different sequences, for example, "ABDC", "BACD", and "BDAC":

Click on image to enlarge.

Figure 2. "AC" and "BD"executed by two threads running concurrently.

Considering the size and complexity of software applications, the number of possible permutations in execution order is enormous. Most permutations might have no effect on the process outcome. Yet some execution scenarios result in undesirable program behavior.

And this is why debugging of multiprocessor applications is so difficult. Occurrence of intermittent failures triggered by a particular execution schedule presents a huge challenge. An intermittent failure may not be captured during tests and software may run successfully for years before a bug reveals itself. Worse, capturing such failure does not help with debugging because there is no mechanism to reproduce the execution schedule that led to it.

Today we have at our disposal various programming techniques for process and thread synchronization. Extensive use of these techniques, however, results in serialization of a multithreaded code which degrades performance. Further, these techniques alone do not make the software bug-proof.

To fix a bug a programmer must be able to reproduce it. It is not uncommon for an engineer to re-run a multithreaded program under the debugger multiple times, hoping to invoke the buggy behavior. This is not a very efficient method, especially when the probability of the buggy behavior is low. Unfortunately, there is not much choice: few tools are available to developers of multithreaded software, and today none of them fully addresses the major issue described above, the lack of process reproducibility.

< Previous
Page 1 of 3
Next >

Loading comments...