Trip over threads to trap multicore bugs

Roni Simonian, Ariadne

April 7, 2011

Roni Simonian, Ariadne


Taking control of scheduling
In order to successfully debug a concurrent application, we have no choice but to take over the functions of the OS scheduler, thus removing the execution uncertainty described in the previous section. A novel OS scheduler simulator called maze will do that for us. As shown in Figure 3 below, this simulator creates a "layer" between the running application and the OS, taking full control over scheduling of all threads and processes in the application. Users are now able to find the exact sequence of instructions that preceded a failure, and to reproduce this sequence.

The simulator runs a program multiple times, each time with a unique thread run/wait pattern. A user can choose a "faulty" run and reproduce the program behavior for debugging. The difference between a process running "on its own," and the process running under the simulator is in its scheduling with respect to other processes and threads within the same application.

When a process runs on its own, the schedule is affected by the states and priorities of all processes running on the machine. This schedule cannot be controlled by a user, recorded, or reproduced.


Click on image to enlarge.

Figure 3. Maze simulator creates a "layer" between the running app and the OS.

On the other hand, as shown in Figure 4 below, when a process is controlled by the simulator, its schedule is unaffected by any processes unrelated to the application. The schedule is deterministic; it can be reproduced on request. If the process creates a child process or a thread, the simulator automatically takes control over the new process as well. All processes and threads of the application run, wait, or sleep by following directives from the simulator.


Click on image to enlarge.

Figure 4. When a process is controlled by the simulator, its schedule is unaffected by any processes unrelated to the application

So what exactly happens to a process when it is run under the scheduler simulator? The process runs normally until it creates a second thread, or forks a child process. At this point the simulator takes over both processes.

If these processes spawn more threads or child processes, they also fall under control of the simulator. The simplified simulation step is as follows:

  1. Randomly select the next thread to run
  2. Randomly select the number of instructions to be executed in the thread
  3. Run the thread until the selected number of instructions are executed, or until the thread blocks
  4. Go back to (1)

The simulation continues until (a) all threads but one, and processes run to completion or get terminated, or until (b) all threads and processes block.

In the Case (a), a single-threaded process may continue uncontrolled since there is no concurrency in the system any longer. If it creates more threads, the simulator will take over the scheduling again.<

Case (b) is of special interest to us. Threads may block for a variety of reasons. One simple example is a deadlock: two threads are waiting for a mutex taken by the other thread. We will look at this example in more detail later on. There may be any number of processes and threads waiting for something and never continue to completion.  They may be waiting for a signal, for I/O, or for another event that will not occur, or for a resource that won't become available - because no process in the system is able to proceed any further.

In real life the only way out of the situation is to terminate all blocking processes and start the application anew. Needless to say, this gives no insight into what actually happened in the system. Worse, this situation cannot be reliably reproduced. And of course when the application is run under a debugger or as a part of a regression suite, the parent application also blocks.

If the parent application is maze, things are different. Maze keeps the threads under tight control, and as soon as there are no more threads and processes capable of running, it stops at once. Moreover, since the controlled processes are alive, it can give the programmer a lot of diagnostic information. It can tell why each of the threads is blocking and print its stack and the registers.
< Previous
Page 2 of 3
Next >

Loading comments...

Most Commented

Parts Search Datasheets.com

KNOWLEDGE CENTER