Debugging parallel programs is generally acknowledged as a difficult
problem. Most of the literature on parallel programming focuses on the
constructive part of creating a parallel algorithm for a particular
problem and how to implement it, and mostly ignores issues of debugging
and troubleshooting [6].
Part of the problem could be that parallel programming has been on
the horizon for a very long time, always in line as the next big thing
but never actually arriving. It’s simply a case of “crying wolf.” And
despite that multiprocessing has also been on the horizon for a long
time, very little in terms of practical tool support for parallel
programming and debugging has been produced [2][7][8][9][12].
Debugging a software problem on a multiprocessor system involves
three main steps: provoking problems; reproducing problems that have
been provoked; and, diagnosing and resolving the reproduced problems.
What makes multitasking programming and debugging difficult is that
provoking and reproducing problems is much harder than with
single-threaded programs. In classic single-tasking programs, most bugs
are deterministic and caused by particular variations of input. Such
deterministic bugs will still occur in each task in a parallel system,
and will be solved in a traditional manner.
The parallel program adds a new category of bugs caused by the
interaction of multiple tasks, often depending on the precise timing of
their execution and communications. Such bugs are the most difficult
bugs to provoke and reproduce, not to mention understand and resolve.
They have a tendency to be Heisenbugs, meaning bugs that disappear when
you try to observe the program in order to discover them [7][8].
Discussed in Part 2 are techniques for debugging multiprocessor
programs in spite of this, some of which focus on provoking and
reproducing bugs and others on diagnosis of the problems, including
breakpoints on single tasks, traces, bigger locks, applying a heavy
load, using a different machine, similating the system, replay
execution, reverse debugging, and formal methods.
Breakpoints on Single Tasks
A traditional single-task debugger can be applied to a parallel
software system by debugging a single task in the traditional manner,
setting breakpoints and watchpoints, and investigating state. This
sometimes does work well, but carries some obvious risks in a parallel
setting:
1) Breakpoints disturb the
timing. Activating a breakpoint or watchpoint and just observing that
it was triggered (i.e. not stopping execution) will make the timing of
the task different. This can very well hide a timing-related bug.
2) A stopped task can be
swamped with traffic. If a single task is stopped and the rest of the
parallel program continues to run, the other tasks in the system might
keep queuing up requests for work and communications with the stopped
task. This might cause an avalanche of problems in the rest of the
parallel program if queues fill up and other tasks get time-outs on
replies.
3) A stopped processor in a
real-time system will quickly cause a system crash, as the computer
system is expected to keep up with the sensors and actuation needs of
the controlled system.
Many tools in the market today support running multiple debug
sessions in parallel, originally in order to support asymmetric
multiprocessors (one window to the DSP, one window to the control
processor). Such solutions typically feature various forms of
synchronizations between the debuggers, so that when one processor
stops, the other stops.
Such global stops have a certain skid time, since it takes a small
but non-zero time for the debugger to catch one breakpoint and order
for the other processors to stop. Also, doing a global stop might not
be feasible in a system connected to the outside world. For example,
the operating system will still need to run in order to handle
interrupts. Stopping a processor does not mean stopping the world.
Such tools are better than a single sequential debugger in an SMP
environment, but the hardware support is not really there to make them
work as well as expected. Consider the issue of using hardware
breakpoints: A hardware breakpoint set in the debug facilities of one
particular processor will not trigger if the task is scheduled onto
another processor before executing. Debuggers for SMPs must work around
such issues, as the hardware itself does not provide the facilities to
migrate breakpoints around with tasks.
Using traces
Tracing is an indirect debug that gathers knowledge: If an error is
provoked or reproduced when tracing is enabled, the trace will
hopefully contain hints to the sequence of events leading up to the
error. Using this information, you can then construct a hypothesis as
to what went wrong. Traces can be gathered in a number of ways: Printf,
monitor code, instrumented code, bus trace, hardware trace using core
support and simulation.
Printf: The most common trace tool (and
probably debug tool overall) is to use printouts in a program to
determine its execution path. The advantage of using printouts inside
the application code is that the information is typically
application-level and semantically rich; a debug printout generally
states that a program has reached a certain point and shows the values
of relevant data at that point.
The obvious disadvantage is that adding printouts to a program
disturbs its timing, potentially changing its behavior so that the bug
disappears. There have been cases where the printouts had to be left
inside shipping systems (albeit silenced), as the system did not work
without them.
Monitor code: A
more structured way to trace is to use special monitor programs that
monitor system execution, outside of the actual application code. Such
monitors will have less timing impact on the overall system, but also
provide less information compared with a printout approach. Some
multiprocessing libraries provide hooks for monitors. Embedded
operating systems typically offer tools that trace operating
system-level events like task scheduling, pre-emption and messages,
which can be very useful to understand the behavior of a system.
Instrumented
code: Some debug and software analysis tools instrument the
source code or binary code of a program to trace and profile its
behavior. Such instrumentation will change the program behavior, but a
tool can use intelligent algorithms to minimize this effect.
Instrumenting binaries makes it possible to specifically investigate
shared memory communications, as each load and store instruction can be
traced.
Bus trace:
Since basically all multiprocessor systems use caches, tracing system
activity on the memory bus will only provide a partial trace. Also,
some systems employ split buses or rings, where there is no single
point of observation. Continued hardware integration is making bus
analysis less and less feasible as time goes by.
Hardware trace
using core support: A hardware trace buffer drawing on
processor core abilities like embedded trace features and JTAG
interfaces is a standard embedded debug tool. Applying a hardware trace
to one processor in a multiprocessor system will provide a good trace
of its low-level behavior, and provide minimal timing disturbance
(provided the trace can keep up with the processor). The information
will have to be digested by a debugger to provide a good insight into
the system behavior.
Simulation: Simulation
(as discussed earlier offers a way to trace the behavior of a system at
a low level without disturbing it. The overall functionality is very
similar to hardware tracing, with the added ability to correlate traces
from multiple processors running in parallel. Simulation avoids the
correlation problem.
A general problem with tracing is that when tracing execution on a
multiprocessor system, it can be hard to reconstruct a global precise
ordering of events (in fact, such an order might not even theoretically
exist). Thus, even if the event log claims that event A happened before
event B based on local time stamps, the opposite might be true from a
global perspective.
In a distributed system where nodes communicate over a network, a
consistent global state and event history is known to be very hard to
accurately reconstruct.
A limitation of most forms of tracing is that they only capture
certain events. No information is provided as to what happens between
the events in the trace, or to parts of the system that are not being
traced. If more fine-grained or different information is needed, the
tracing has to be updated correspondingly and the program re-executed.
Detailed traces of the inputs and outputs of a particular processor,
program, task or system can be used to replay execution. [8].
Bigger Locks
If the problem in a parallel program appears to be the corruption of
shared data, a common debug technique is to make the locked sections
bigger, reducing available parallelism, until the program works.
Alternatively, the program is first implemented using very coarse
locking to ensure correct function initially. This is typically what
has happened in previous moves to multiprocessing in the computer
industry, for example in the multiprocessor ports and optimization of
Linux, SunOS [9] and MacOS X [10].
After locks have been made coarser, the scope of locking is
carefully reduced in scope to enhance performance, testing after each
change to make sure that execution is still correct. Finally, since
fine-grained locking is usually used to increase performance,
performance has to be analyzed. If the wrong locks are made
finer-grained, no performance increase might result from an extensive
effort. Too fine-grained locking might also create inefficiencies when
execution time is spent managing locks instead of performing useful
computation.
Applying a Heavy Load
Reliably provoking and reproducing errors is one of the major problems
in debugging parallel systems. Since errors typically depend on timing,
stretching the timing is a good way to provoke errors, and this can be
done by increasing the load on a system.
At close to 100 percent CPU load, a system is much more likely to
exhibit errors than at light loads. The best way to achieve this is to
create a test system that can repeatedly apply a very heavy load to a
system by starting a large number of tasks. The tasks should preferably
be a mix of different types of programs, in order to exercise all parts
of the software system. This tactic has proven very successful in
ensuring high reliability in systems like IBM mainframes [11].
Using a Different Machine
For most embedded systems, the machine a piece of code will run on is
usually known. The hardware design and part selection is part of the
system development. Thus, running a piece of code on a different
machine might seem pointless: We want it to work on the target system.
However, testing on different hardware is a good way to find timing
bugs, as the execution timing and task scheduling is likely to be
different. It also serves as some form of future-proofing of the code,
as code written today will be used in future upgraded systems using
faster processors and/or more processor cores.
A machine can be different in several relevant ways: Each processor
might be faster or slower than the actual target. It might also have a
different number of processors, more or fewer. Moving to different
processor architecture or a different operating system is likely to
take as much time under a tight schedule, but if it can be done, it is
an excellent way of flushing out bad assumptions in the code. Portable
code is in general more correct code.
Simulate the System
Simulation offers an interesting variant on using a different machine.
The key problem with debugging a parallel system is lack of
synchronization between parallel processors and determinism in
execution. The inherent chaotic behavior makes cyclical debugging very
hard. There is one technique that overcomes these problems: simulation
of the target system.
Traditionally, simulation has been used as a means to develop and
run software before the real hardware was available. In the area of
parallel computer system execution, simulation remains useful even
after hardware becomes available – simulation of a parallel machine
provides explicit control over the execution of instructions and
propagation of information, which makes for a very powerful debugging
tool.
A simulated system fundamentally provides determinism, as each
simulation run repeats the same execution (unless nondeterministic
inputs or timing variations are provided as external input). This
enables cyclical debugging again. It also makes reproducing errors
easier, as once a bug has been provoked in a simulation, the same bug
scenario can be replayed over and over again in exactly the same
manner.
Simulators will commonly be used as a backend to standard debuggers,
in which a user simply connects to the simulator instead of to a JTAG
probe, remote debug server or other debug port. The simulator makes it
possible to single-step one task while time is virtually standing still
for other parts of the system, which solves the problems with
breakpoints discussed earlier. For a real-time system where a
simulation of the external world is available, simulation also makes it
possible to single-step code and to stop the system without the
external world running ahead.
Full-system simulators capable of running the complete software
stack (from firmware to application software) also provide the ability
to inspect and debug the interaction of an application with the
operating system, as well as low-level code such as operating-system
kernels and device drivers. Fundamentally, a full-system simulator
provides a controlling layer underneath the operating system, which
provides capabilities which cannot be implemented in traditional debug
tools.
A simulator will not be able to faithfully reproduce the detailed
timing of a real system in all but the simplest cases. This is not
really a problem for software development, as the goal is to find and
eliminate software bugs: If they occur in the simulator, they could
have happened in some real circumstance as well, and thus they are
valid bugs that should be fixed.
Furthermore, a simulator offers an interesting variant of running on
a different system. The simulator can be used to increase the number of
processors present beyond that available on any real-world machine in
order to stress the software, and checks its behavior on future
machines with more cores. Simulation can make different processors run
at different speeds in order to provoke errors.
Performing such corner-case chasing in simulators is common in
hardware design, and it has great potential as a tool for software
developers as well.
Several vendors offer simulation solutions for embedded systems of
varying scope. The investment to create a simulation model can be quite
high. But for larger development projects, the benefits typically
outweigh the costs, especially when considering the added value of
reverse debugging.
Replay Execution
A recurring theme in parallel system debug is the problem of
re-executing a multitasking, multiprocessor program, which will result
in a different execution, potentially failing to trigger bugs. If a
program can be forced to execute in the same way multiple times,
debugging will be much easier. Such techniques are known as
record-replay techniques. Replay is different from tracing, as the
amount of data needed to facilitate a deterministic replay can be made
much smaller than a full trace. The trace recorded only needs to
contain the non-deterministic events from the real system, like task
switches, message arrivals and other communication between tasks
[7][8].
To enable replay, the system has to contain a monitor that can
record relevant data as well as force the execution to replay the same
behavior. Typically, this has to work on the operating-system level,
but it can also be applied at the application level if a program has
sufficiently well-defined interfaces to the outside world. (A state
machine is a good example of this: The time and type of messages
arriving will determine its precise behavior).
In a parallel program setting, replay can be done on a single task,
on a set of tasks (an entire application), or even on all software
running on a single computer node. Recording-based replay allows the
debugger to isolate one part of the system, as all input values and
asynchronous timings are known. The rest of the system does not need to
be running in the debug session, simplifying the problem.
Reverse Debugging
One of the key problems in debugging parallel programs in general, and
parallel programs running on multiprocessors in particular, is that
re-executing a program to put a breakpoint to locate an error is likely
not to reproduce an error.
Instead, a programmer would like to be able to go back from the
point at which an error occurs and investigate the events that
immediately preceded the error. What is needed here is a tool that
allows reverse debugging, i.e. the ability to back up in a program from
the current state instead of (attempting to) reproducing the state by
executing from the start [12].
Reverse debugging is particularly useful for bugs corrupting the
state of the machine. If the stack is overwritten or a process has
terminated, there is little material to use for post-mortem bug
analysis. By backing up in time, the state prior to the bug can be
examined.
Tools for reverse debugging do exist today in the market. They can
be based on simulation, trace collection or virtual machines [15]. Such
tools offer the ability to back up the execution to examine the path
that led to a tricky bug. Tools based on traces will by necessity have
a limited window of observation (limited by the size of the trace
buffer), while simulation- and virtual-machine-based tools can make use
of the resources of a powerful workstation to allow much longer
sequences of execution to be observed. Also, most tools do not offer
multiprocessor support.
Compared to replay as discussed earlier, reverse debugging is
typically faster in practical use, as backing up a short distance in
the execution is faster than reloading and replaying a complete
execution. One of the main advantages of reverse debugging is actually
that a programmer can focus on a small context and go back and forth
over it, applying different breakpoints and traces to understand the
bug, without long turnaround times that break the flow.
Formal Methods
No overview of parallel debugging techniques is complete without
mention of formal methods. Researchers have proposed many different
algorithms and techniques for formally proving the absence of errors or
automatically locating errors in parallel programs. Some of these
methods have made it onto the market. One instance is Intel’s
ThreadChecker, a tool that attempts to check that all shared data
accesses follow a correct locking regime. If the locking regime assumed
by the tool is used in your program, it can offer very good help for
locating that particular (and important) class of bugs.
In general, real-world formal method tools applicable to real
programs will be analogous to the classic lint program: They will find
potential problems and initially produce many false warnings. Not all
problems flagged will be real problems, as the tools’ understanding of
the code will always be incomplete. They will also only find the types
of bugs they were designed to find. For example, a wild pointer will
not be found by a tool checking for deadlock in message passing.
To read Part 1 in this series, go to "Programming
parallel machines."
>
Acknowledgements
Thanks to Professor Erik Hagersten, Lars Albertsson, Dan Wallin, Henrik
Löf, and my colleagues at Virtutech for discussions and input on
multiprocessing and software development.
This article is excerpted from a paper of the same name presented at the
Embedded Systems Conference Silicon Valley 2006. Used with permission of
the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.
Jakob Engblom is Business
Development Manager at Virtutech and
is also an adjunct professor at the department of Information
Technology at Uppsala university. He has been speaking about embedded
systems programming at university courses and industry trade shows for
the past five years. His interests include embedded systems
programming, simulation technology, real-time systems, and compiler
technology.
To learn about this general subject on Embedded.com go to
More
about multicores, multiprocessing and tools.
More Reading
If you want to know more about this subject, here are some recommended
articles and books that will provide insight into the issues of
parallel programming:
[1] Luiz Barroso. The Price of Performance, ACM Queue, September
2005.
[2] Jeff Bier. Back to the drawing board on multicores,
multiprocessing,
Embedded.com
, Nov 28, 2005.
[3] Michael Christofferson. Using an asymmetric multiprocessor model to
build hybrid multicore designs, Embedded.com,
Nov 5, 2005.
[4] Scott Gardner. Multicore Everywhere: The x86 and Beyond.
Extremetech.com, Jan 9, 2006.
[5] Kevin Krewell. ARM Opens Up to SMP – MPCore Supports One to Four
CPUs, Microprocessor Report, May 24, 2004.
[6] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar.
Introduction to Parallel Computing (2nd edition), Addison Wesley, 2003.
[7] Joel Huselius. Debugging Parallel Systems: A State of the Art
Report. MRTC Report no. 63, (www.mrtc.mdh.se),
September 2002.
[8] Charles E. McDowell and David P. Helmbold. Debugging Concurrent
Programs, ACM Computing Surveys, December 1989.
[9] Mache Creeger. Multicore CPUs for the Masses, ACM Queue, September
2005. [10] John Siracusa, MacOS X 10.4 Tiger, ArsTechnica
(www.arstechnica.com), April 28, 2005.
[11] S. Loveland, G. Miller, R. Previtt, and M. Shannon: Testing z/OS:
The Premier Operating System for IBM’s zSeries Server. IBM Systems
Journal, Vol 41, No 1, 2002.
[12] Herb Sutter and James Larus. Software and the Concurrency
Revolution, ACM Queue, September 2005.
[13] Herb Sutter. The Free Lunch is Over, Dr. Dobbs Journal, March
2005.
[14] John L Hennessy and David A Pattersson. Computer Architecture: A
Quantitative Approach, 3 rd Edition. Morgan Kaufmann, May 2002.
[15] Samuel King, George Dunlap, and Peter Chen. Debugging operating
systems with time-traveling virtual machines, Proc. USENIX Annual
Technical Conference, April 2005.
[16] Kang Su Gatlin and Pete Isensee. Reap the Benefits of
Multithreading without All the Work, MSDN Magazine, October 2005.