Debugging parallel programs is generally acknowledged as a difficultproblem. Most of the literature on parallel programming focuses on theconstructive part of creating a parallel algorithm for a particularproblem and how to implement it, and mostly ignores issues of debuggingand troubleshooting .
Part of the problem could be that parallel programming has been onthe horizon for a very long time, always in line as the next big thingbut never actually arriving. It’s simply a case of “crying wolf.” Anddespite that multiprocessing has also been on the horizon for a longtime, very little in terms of practical tool support for parallelprogramming and debugging has been produced .
Debugging a software problem on a multiprocessor system involvesthree main steps: provoking problems; reproducing problems that havebeen provoked; and, diagnosing and resolving the reproduced problems.
What makes multitasking programming and debugging difficult is thatprovoking and reproducing problems is much harder than withsingle-threaded programs. In classic single-tasking programs, most bugsare deterministic and caused by particular variations of input. Suchdeterministic bugs will still occur in each task in a parallel system,and will be solved in a traditional manner.
The parallel program adds a new category of bugs caused by theinteraction of multiple tasks, often depending on the precise timing oftheir execution and communications. Such bugs are the most difficultbugs to provoke and reproduce, not to mention understand and resolve.They have a tendency to be Heisenbugs, meaning bugs that disappear whenyou try to observe the program in order to discover them .
Discussed in Part 2 are techniques for debugging multiprocessorprograms in spite of this, some of which focus on provoking andreproducing bugs and others on diagnosis of the problems, includingbreakpoints on single tasks, traces, bigger locks, applying a heavyload, using a different machine, similating the system, replayexecution, reverse debugging, and formal methods.
Breakpoints on Single Tasks
A traditional single-task debugger can be applied to a parallelsoftware system by debugging a single task in the traditional manner,setting breakpoints and watchpoints, and investigating state. Thissometimes does work well, but carries some obvious risks in a parallelsetting:
1) Breakpoints disturb thetiming. Activating a breakpoint or watchpoint and just observing thatit was triggered (i.e. not stopping execution) will make the timing ofthe task different. This can very well hide a timing-related bug.
2) A stopped task can beswamped with traffic. If a single task is stopped and the rest of theparallel program continues to run, the other tasks in the system mightkeep queuing up requests for work and communications with the stoppedtask. This might cause an avalanche of problems in the rest of theparallel program if queues fill up and other tasks get time-outs onreplies.
3) A stopped processor in areal-time system will quickly cause a system crash, as the computersystem is expected to keep up with the sensors and actuation needs ofthe controlled system.
Many tools in the market today support running multiple debugsessions in parallel, originally in order to support asymmetricmultiprocessors (one window to the DSP, one window to the controlprocessor). Such solutions typically feature various forms ofsynchronizations between the debuggers, so that when one processorstops, the other stops.
Such global stops have a certain skid time, since it takes a smallbut non-zero time for the debugger to catch one breakpoint and orderfor the other processors to stop. Also, doing a global stop might notbe feasible in a system connected to the outside world. For example,the operating system will still need to run in order to handleinterrupts. Stopping a processor does not mean stopping the world.
Such tools are better than a single sequential debugger in an SMPenvironment, but the hardware support is not really there to make themwork as well as expected. Consider the issue of using hardwarebreakpoints: A hardware breakpoint set in the debug facilities of oneparticular processor will not trigger if the task is scheduled ontoanother processor before executing. Debuggers for SMPs must work aroundsuch issues, as the hardware itself does not provide the facilities tomigrate breakpoints around with tasks.
Tracing is an indirect debug that gathers knowledge: If an error isprovoked or reproduced when tracing is enabled, the trace willhopefully contain hints to the sequence of events leading up to theerror. Using this information, you can then construct a hypothesis asto what went wrong. Traces can be gathered in a number of ways: Printf,monitor code, instrumented code, bus trace, hardware trace using coresupport and simulation.
Printf: The most common trace tool (andprobably debug tool overall) is to use printouts in a program todetermine its execution path. The advantage of using printouts insidethe application code is that the information is typicallyapplication-level and semantically rich; a debug printout generallystates that a program has reached a certain point and shows the valuesof relevant data at that point.
The obvious disadvantage is that adding printouts to a programdisturbs its timing, potentially changing its behavior so that the bugdisappears. There have been cases where the printouts had to be leftinside shipping systems (albeit silenced), as the system did not workwithout them.
Monitor code: Amore structured way to trace is to use special monitor programs thatmonitor system execution, outside of the actual application code. Suchmonitors will have less timing impact on the overall system, but alsoprovide less information compared with a printout approach. Somemultiprocessing libraries provide hooks for monitors. Embeddedoperating systems typically offer tools that trace operatingsystem-level events like task scheduling, pre-emption and messages,which can be very useful to understand the behavior of a system.
Instrumentedcode: Some debug and software analysis tools instrument thesource code or binary code of a program to trace and profile itsbehavior. Such instrumentation will change the program behavior, but atool can use intelligent algorithms to minimize this effect.Instrumenting binaries makes it possible to specifically investigateshared memory communications, as each load and store instruction can betraced.
Bus trace: Since basically all multiprocessor systems use caches, tracing systemactivity on the memory bus will only provide a partial trace. Also,some systems employ split buses or rings, where there is no singlepoint of observation. Continued hardware integration is making busanalysis less and less feasible as time goes by.
Hardware traceusing core support: A hardware trace buffer drawing onprocessor core abilities like embedded trace features and JTAGinterfaces is a standard embedded debug tool. Applying a hardware traceto one processor in a multiprocessor system will provide a good traceof its low-level behavior, and provide minimal timing disturbance(provided the trace can keep up with the processor). The informationwill have to be digested by a debugger to provide a good insight intothe system behavior.
Simulation: Simulation(as discussed earlier offers a way to trace the behavior of a system ata low level without disturbing it. The overall functionality is verysimilar to hardware tracing, with the added ability to correlate tracesfrom multiple processors running in parallel. Simulation avoids thecorrelation problem.
A general problem with tracing is that when tracing execution on amultiprocessor system, it can be hard to reconstruct a global preciseordering of events (in fact, such an order might not even theoreticallyexist). Thus, even if the event log claims that event A happened beforeevent B based on local time stamps, the opposite might be true from aglobal perspective.
In a distributed system where nodes communicate over a network, aconsistent global state and event history is known to be very hard toaccurately reconstruct.
A limitation of most forms of tracing is that they only capturecertain events. No information is provided as to what happens betweenthe events in the trace, or to parts of the system that are not beingtraced. If more fine-grained or different information is needed, thetracing has to be updated correspondingly and the program re-executed.
Detailed traces of the inputs and outputs of a particular processor,program, task or system can be used to replay execution. .
If the problem in a parallel program appears to be the corruption ofshared data, a common debug technique is to make the locked sectionsbigger, reducing available parallelism, until the program works.Alternatively, the program is first implemented using very coarselocking to ensure correct function initially. This is typically whathas happened in previous moves to multiprocessing in the computerindustry, for example in the multiprocessor ports and optimization ofLinux, SunOS  and MacOS X .
After locks have been made coarser, the scope of locking iscarefully reduced in scope to enhance performance, testing after eachchange to make sure that execution is still correct. Finally, sincefine-grained locking is usually used to increase performance,performance has to be analyzed. If the wrong locks are madefiner-grained, no performance increase might result from an extensiveeffort. Too fine-grained locking might also create inefficiencies whenexecution time is spent managing locks instead of performing usefulcomputation.
Applying a Heavy Load
Reliably provoking and reproducing errors is one of the major problemsin debugging parallel systems. Since errors typically depend on timing,stretching the timing is a good way to provoke errors, and this can bedone by increasing the load on a system.
At close to 100 percent CPU load, a system is much more likely toexhibit errors than at light loads. The best way to achieve this is tocreate a test system that can repeatedly apply a very heavy load to asystem by starting a large number of tasks. The tasks should preferablybe a mix of different types of programs, in order to exercise all partsof the software system. This tactic has proven very successful inensuring high reliability in systems like IBM mainframes .
Using a Different Machine
For most embedded systems, the machine a piece of code will run on isusually known. The hardware design and part selection is part of thesystem development. Thus, running a piece of code on a differentmachine might seem pointless: We want it to work on the target system.However, testing on different hardware is a good way to find timingbugs, as the execution timing and task scheduling is likely to bedifferent. It also serves as some form of future-proofing of the code,as code written today will be used in future upgraded systems usingfaster processors and/or more processor cores.
A machine can be different in several relevant ways: Each processormight be faster or slower than the actual target. It might also have adifferent number of processors, more or fewer. Moving to differentprocessor architecture or a different operating system is likely totake as much time under a tight schedule, but if it can be done, it isan excellent way of flushing out bad assumptions in the code. Portablecode is in general more correct code.
Simulate the System
Simulation offers an interesting variant on using a different machine.The key problem with debugging a parallel system is lack ofsynchronization between parallel processors and determinism inexecution. The inherent chaotic behavior makes cyclical debugging veryhard. There is one technique that overcomes these problems: simulationof the target system.
Traditionally, simulation has been used as a means to develop andrun software before the real hardware was available. In the area ofparallel computer system execution, simulation remains useful evenafter hardware becomes available – simulation of a parallel machineprovides explicit control over the execution of instructions andpropagation of information, which makes for a very powerful debuggingtool.
A simulated system fundamentally provides determinism, as eachsimulation run repeats the same execution (unless nondeterministicinputs or timing variations are provided as external input). Thisenables cyclical debugging again. It also makes reproducing errorseasier, as once a bug has been provoked in a simulation, the same bugscenario can be replayed over and over again in exactly the samemanner.
Simulators will commonly be used as a backend to standard debuggers,in which a user simply connects to the simulator instead of to a JTAGprobe, remote debug server or other debug port. The simulator makes itpossible to single-step one task while time is virtually standing stillfor other parts of the system, which solves the problems withbreakpoints discussed earlier. For a real-time system where asimulation of the external world is available, simulation also makes itpossible to single-step code and to stop the system without theexternal world running ahead.
Full-system simulators capable of running the complete softwarestack (from firmware to application software) also provide the abilityto inspect and debug the interaction of an application with theoperating system, as well as low-level code such as operating-systemkernels and device drivers. Fundamentally, a full-system simulatorprovides a controlling layer underneath the operating system, whichprovides capabilities which cannot be implemented in traditional debugtools.
A simulator will not be able to faithfully reproduce the detailedtiming of a real system in all but the simplest cases. This is notreally a problem for software development, as the goal is to find andeliminate software bugs: If they occur in the simulator, they couldhave happened in some real circumstance as well, and thus they arevalid bugs that should be fixed.
Furthermore, a simulator offers an interesting variant of running ona different system. The simulator can be used to increase the number ofprocessors present beyond that available on any real-world machine inorder to stress the software, and checks its behavior on futuremachines with more cores. Simulation can make different processors runat different speeds in order to provoke errors.
Performing such corner-case chasing in simulators is common inhardware design, and it has great potential as a tool for softwaredevelopers as well.
Several vendors offer simulation solutions for embedded systems ofvarying scope. The investment to create a simulation model can be quitehigh. But for larger development projects, the benefits typicallyoutweigh the costs, especially when considering the added value ofreverse debugging.
A recurring theme in parallel system debug is the problem ofre-executing a multitasking, multiprocessor program, which will resultin a different execution, potentially failing to trigger bugs. If aprogram can be forced to execute in the same way multiple times,debugging will be much easier. Such techniques are known asrecord-replay techniques. Replay is different from tracing, as theamount of data needed to facilitate a deterministic replay can be mademuch smaller than a full trace. The trace recorded only needs tocontain the non-deterministic events from the real system, like taskswitches, message arrivals and other communication between tasks.
To enable replay, the system has to contain a monitor that canrecord relevant data as well as force the execution to replay the samebehavior. Typically, this has to work on the operating-system level,but it can also be applied at the application level if a program hassufficiently well-defined interfaces to the outside world. (A statemachine is a good example of this: The time and type of messagesarriving will determine its precise behavior).
In a parallel program setting, replay can be done on a single task,on a set of tasks (an entire application), or even on all softwarerunning on a single computer node. Recording-based replay allows thedebugger to isolate one part of the system, as all input values andasynchronous timings are known. The rest of the system does not need tobe running in the debug session, simplifying the problem.
One of the key problems in debugging parallel programs in general, andparallel programs running on multiprocessors in particular, is thatre-executing a program to put a breakpoint to locate an error is likelynot to reproduce an error.
Instead, a programmer would like to be able to go back from thepoint at which an error occurs and investigate the events thatimmediately preceded the error. What is needed here is a tool thatallows reverse debugging, i.e. the ability to back up in a program fromthe current state instead of (attempting to) reproducing the state byexecuting from the start .
Reverse debugging is particularly useful for bugs corrupting thestate of the machine. If the stack is overwritten or a process hasterminated, there is little material to use for post-mortem buganalysis. By backing up in time, the state prior to the bug can beexamined.
Tools for reverse debugging do exist today in the market. They canbe based on simulation, trace collection or virtual machines . Suchtools offer the ability to back up the execution to examine the paththat led to a tricky bug. Tools based on traces will by necessity havea limited window of observation (limited by the size of the tracebuffer), while simulation- and virtual-machine-based tools can make useof the resources of a powerful workstation to allow much longersequences of execution to be observed. Also, most tools do not offermultiprocessor support.
Compared to replay as discussed earlier, reverse debugging istypically faster in practical use, as backing up a short distance inthe execution is faster than reloading and replaying a completeexecution. One of the main advantages of reverse debugging is actuallythat a programmer can focus on a small context and go back and forthover it, applying different breakpoints and traces to understand thebug, without long turnaround times that break the flow.
No overview of parallel debugging techniques is complete withoutmention of formal methods. Researchers have proposed many differentalgorithms and techniques for formally proving the absence of errors orautomatically locating errors in parallel programs. Some of thesemethods have made it onto the market. One instance is Intel’sThreadChecker, a tool that attempts to check that all shared dataaccesses follow a correct locking regime. If the locking regime assumedby the tool is used in your program, it can offer very good help forlocating that particular (and important) class of bugs.
In general, real-world formal method tools applicable to realprograms will be analogous to the classic lint program: They will findpotential problems and initially produce many false warnings. Not allproblems flagged will be real problems, as the tools’ understanding ofthe code will always be incomplete. They will also only find the typesof bugs they were designed to find. For example, a wild pointer willnot be found by a tool checking for deadlock in message passing.
To read Part 1 in this series, go to “Programmingparallel machines.”
Thanks to Professor Erik Hagersten, Lars Albertsson, Dan Wallin, HenrikLöf, and my colleagues at Virtutech for discussions and input onmultiprocessing and software development.
This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Silicon Valley 2006. Used with permission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.
Jakob Engblom is BusinessDevelopment Manager at Virtutech andis also an adjunct professor at the department of InformationTechnology at Uppsala university. He has been speaking about embeddedsystems programming at university courses and industry trade shows forthe past five years. His interests include embedded systemsprogramming, simulation technology, real-time systems, and compilertechnology.
To learn about this general subject on Embedded.com go to Moreabout multicores, multiprocessing and tools.
If you want to know more about this subject, here are some recommendedarticles and books that will provide insight into the issues ofparallel programming:
 Luiz Barroso. The Price of Performance, ACM Queue, September2005.
 Jeff Bier. Back to the drawing board on multicores,multiprocessing,Embedded.com, Nov 28, 2005.
 Michael Christofferson. Using an asymmetric multiprocessor model tobuild hybrid multicore designs, Embedded.com,Nov 5, 2005.
 Scott Gardner. Multicore Everywhere: The x86 and Beyond.Extremetech.com, Jan 9, 2006.
 Kevin Krewell. ARM Opens Up to SMP – MPCore Supports One to FourCPUs, Microprocessor Report, May 24, 2004.
 Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar.Introduction to Parallel Computing (2nd edition), Addison Wesley, 2003.
 Joel Huselius. Debugging Parallel Systems: A State of the ArtReport. MRTC Report no. 63, (www.mrtc.mdh.se),September 2002.
 Charles E. McDowell and David P. Helmbold. Debugging ConcurrentPrograms, ACM Computing Surveys, December 1989.
 Mache Creeger. Multicore CPUs for the Masses, ACM Queue, September2005.  John Siracusa, MacOS X 10.4 Tiger, ArsTechnica(www.arstechnica.com), April 28, 2005.
 S. Loveland, G. Miller, R. Previtt, and M. Shannon: Testing z/OS:The Premier Operating System for IBM’s zSeries Server. IBM SystemsJournal, Vol 41, No 1, 2002.
 Herb Sutter and James Larus. Software and the ConcurrencyRevolution, ACM Queue, September 2005.
 Herb Sutter. The Free Lunch is Over, Dr. Dobbs Journal, March2005.
 John L Hennessy and David A Pattersson. Computer Architecture: AQuantitative Approach, 3 rd Edition. Morgan Kaufmann, May 2002.
 Samuel King, George Dunlap, and Peter Chen. Debugging operatingsystems with time-traveling virtual machines, Proc. USENIX AnnualTechnical Conference, April 2005.
 Kang Su Gatlin and Pete Isensee. Reap the Benefits ofMultithreading without All the Work, MSDN Magazine, October 2005.