CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Debugging real-time multiprocessor systems: Part 2
Debugging Parallel Programs



Embedded.com
Debugging parallel programs is generally acknowledged as a difficult problem. Most of the literature on parallel programming focuses on the constructive part of creating a parallel algorithm for a particular problem and how to implement it, and mostly ignores issues of debugging and troubleshooting [6].

Part of the problem could be that parallel programming has been on the horizon for a very long time, always in line as the next big thing but never actually arriving. It’s simply a case of “crying wolf.” And despite that multiprocessing has also been on the horizon for a long time, very little in terms of practical tool support for parallel programming and debugging has been produced [2][7][8][9][12].

Debugging a software problem on a multiprocessor system involves three main steps: provoking problems; reproducing problems that have been provoked; and, diagnosing and resolving the reproduced problems.

What makes multitasking programming and debugging difficult is that provoking and reproducing problems is much harder than with single-threaded programs. In classic single-tasking programs, most bugs are deterministic and caused by particular variations of input. Such deterministic bugs will still occur in each task in a parallel system, and will be solved in a traditional manner.

The parallel program adds a new category of bugs caused by the interaction of multiple tasks, often depending on the precise timing of their execution and communications. Such bugs are the most difficult bugs to provoke and reproduce, not to mention understand and resolve. They have a tendency to be Heisenbugs, meaning bugs that disappear when you try to observe the program in order to discover them [7][8].

Discussed in Part 2 are techniques for debugging multiprocessor programs in spite of this, some of which focus on provoking and reproducing bugs and others on diagnosis of the problems, including breakpoints on single tasks, traces, bigger locks, applying a heavy load, using a different machine, similating the system, replay execution, reverse debugging, and formal methods.

Breakpoints on Single Tasks
A traditional single-task debugger can be applied to a parallel software system by debugging a single task in the traditional manner, setting breakpoints and watchpoints, and investigating state. This sometimes does work well, but carries some obvious risks in a parallel setting:

1) Breakpoints disturb the timing. Activating a breakpoint or watchpoint and just observing that it was triggered (i.e. not stopping execution) will make the timing of the task different. This can very well hide a timing-related bug.

2) A stopped task can be swamped with traffic. If a single task is stopped and the rest of the parallel program continues to run, the other tasks in the system might keep queuing up requests for work and communications with the stopped task. This might cause an avalanche of problems in the rest of the parallel program if queues fill up and other tasks get time-outs on replies.

3) A stopped processor in a real-time system will quickly cause a system crash, as the computer system is expected to keep up with the sensors and actuation needs of the controlled system.

Many tools in the market today support running multiple debug sessions in parallel, originally in order to support asymmetric multiprocessors (one window to the DSP, one window to the control processor). Such solutions typically feature various forms of synchronizations between the debuggers, so that when one processor stops, the other stops.

Such global stops have a certain skid time, since it takes a small but non-zero time for the debugger to catch one breakpoint and order for the other processors to stop. Also, doing a global stop might not be feasible in a system connected to the outside world. For example, the operating system will still need to run in order to handle interrupts. Stopping a processor does not mean stopping the world.

Such tools are better than a single sequential debugger in an SMP environment, but the hardware support is not really there to make them work as well as expected. Consider the issue of using hardware breakpoints: A hardware breakpoint set in the debug facilities of one particular processor will not trigger if the task is scheduled onto another processor before executing. Debuggers for SMPs must work around such issues, as the hardware itself does not provide the facilities to migrate breakpoints around with tasks.

Using traces
Tracing is an indirect debug that gathers knowledge: If an error is provoked or reproduced when tracing is enabled, the trace will hopefully contain hints to the sequence of events leading up to the error. Using this information, you can then construct a hypothesis as to what went wrong. Traces can be gathered in a number of ways: Printf, monitor code, instrumented code, bus trace, hardware trace using core support and simulation.

Printf: The most common trace tool (and probably debug tool overall) is to use printouts in a program to determine its execution path. The advantage of using printouts inside the application code is that the information is typically application-level and semantically rich; a debug printout generally states that a program has reached a certain point and shows the values of relevant data at that point.

The obvious disadvantage is that adding printouts to a program disturbs its timing, potentially changing its behavior so that the bug disappears. There have been cases where the printouts had to be left inside shipping systems (albeit silenced), as the system did not work without them.

Monitor code: A more structured way to trace is to use special monitor programs that monitor system execution, outside of the actual application code. Such monitors will have less timing impact on the overall system, but also provide less information compared with a printout approach. Some multiprocessing libraries provide hooks for monitors. Embedded operating systems typically offer tools that trace operating system-level events like task scheduling, pre-emption and messages, which can be very useful to understand the behavior of a system.

Instrumented code: Some debug and software analysis tools instrument the source code or binary code of a program to trace and profile its behavior. Such instrumentation will change the program behavior, but a tool can use intelligent algorithms to minimize this effect. Instrumenting binaries makes it possible to specifically investigate shared memory communications, as each load and store instruction can be traced.

Bus trace: Since basically all multiprocessor systems use caches, tracing system activity on the memory bus will only provide a partial trace. Also, some systems employ split buses or rings, where there is no single point of observation. Continued hardware integration is making bus analysis less and less feasible as time goes by.

Hardware trace using core support: A hardware trace buffer drawing on processor core abilities like embedded trace features and JTAG interfaces is a standard embedded debug tool. Applying a hardware trace to one processor in a multiprocessor system will provide a good trace of its low-level behavior, and provide minimal timing disturbance (provided the trace can keep up with the processor). The information will have to be digested by a debugger to provide a good insight into the system behavior.

Simulation: Simulation (as discussed earlier offers a way to trace the behavior of a system at a low level without disturbing it. The overall functionality is very similar to hardware tracing, with the added ability to correlate traces from multiple processors running in parallel. Simulation avoids the correlation problem.

A general problem with tracing is that when tracing execution on a multiprocessor system, it can be hard to reconstruct a global precise ordering of events (in fact, such an order might not even theoretically exist). Thus, even if the event log claims that event A happened before event B based on local time stamps, the opposite might be true from a global perspective.

In a distributed system where nodes communicate over a network, a consistent global state and event history is known to be very hard to accurately reconstruct.

A limitation of most forms of tracing is that they only capture certain events. No information is provided as to what happens between the events in the trace, or to parts of the system that are not being traced. If more fine-grained or different information is needed, the tracing has to be updated correspondingly and the program re-executed.

Detailed traces of the inputs and outputs of a particular processor, program, task or system can be used to replay execution. [8].

Bigger Locks
If the problem in a parallel program appears to be the corruption of shared data, a common debug technique is to make the locked sections bigger, reducing available parallelism, until the program works. Alternatively, the program is first implemented using very coarse locking to ensure correct function initially. This is typically what has happened in previous moves to multiprocessing in the computer industry, for example in the multiprocessor ports and optimization of Linux, SunOS [9] and MacOS X [10].

After locks have been made coarser, the scope of locking is carefully reduced in scope to enhance performance, testing after each change to make sure that execution is still correct. Finally, since fine-grained locking is usually used to increase performance, performance has to be analyzed. If the wrong locks are made finer-grained, no performance increase might result from an extensive effort. Too fine-grained locking might also create inefficiencies when execution time is spent managing locks instead of performing useful computation.

Applying a Heavy Load
Reliably provoking and reproducing errors is one of the major problems in debugging parallel systems. Since errors typically depend on timing, stretching the timing is a good way to provoke errors, and this can be done by increasing the load on a system.

At close to 100 percent CPU load, a system is much more likely to exhibit errors than at light loads. The best way to achieve this is to create a test system that can repeatedly apply a very heavy load to a system by starting a large number of tasks. The tasks should preferably be a mix of different types of programs, in order to exercise all parts of the software system. This tactic has proven very successful in ensuring high reliability in systems like IBM mainframes [11].

Using a Different Machine
For most embedded systems, the machine a piece of code will run on is usually known. The hardware design and part selection is part of the system development. Thus, running a piece of code on a different machine might seem pointless: We want it to work on the target system. However, testing on different hardware is a good way to find timing bugs, as the execution timing and task scheduling is likely to be different. It also serves as some form of future-proofing of the code, as code written today will be used in future upgraded systems using faster processors and/or more processor cores.

A machine can be different in several relevant ways: Each processor might be faster or slower than the actual target. It might also have a different number of processors, more or fewer. Moving to different processor architecture or a different operating system is likely to take as much time under a tight schedule, but if it can be done, it is an excellent way of flushing out bad assumptions in the code. Portable code is in general more correct code.

Simulate the System
Simulation offers an interesting variant on using a different machine. The key problem with debugging a parallel system is lack of synchronization between parallel processors and determinism in execution. The inherent chaotic behavior makes cyclical debugging very hard. There is one technique that overcomes these problems: simulation of the target system.

Traditionally, simulation has been used as a means to develop and run software before the real hardware was available. In the area of parallel computer system execution, simulation remains useful even after hardware becomes available – simulation of a parallel machine provides explicit control over the execution of instructions and propagation of information, which makes for a very powerful debugging tool.

A simulated system fundamentally provides determinism, as each simulation run repeats the same execution (unless nondeterministic inputs or timing variations are provided as external input). This enables cyclical debugging again. It also makes reproducing errors easier, as once a bug has been provoked in a simulation, the same bug scenario can be replayed over and over again in exactly the same manner.

Simulators will commonly be used as a backend to standard debuggers, in which a user simply connects to the simulator instead of to a JTAG probe, remote debug server or other debug port. The simulator makes it possible to single-step one task while time is virtually standing still for other parts of the system, which solves the problems with breakpoints discussed earlier. For a real-time system where a simulation of the external world is available, simulation also makes it possible to single-step code and to stop the system without the external world running ahead.

Full-system simulators capable of running the complete software stack (from firmware to application software) also provide the ability to inspect and debug the interaction of an application with the operating system, as well as low-level code such as operating-system kernels and device drivers. Fundamentally, a full-system simulator provides a controlling layer underneath the operating system, which provides capabilities which cannot be implemented in traditional debug tools.

A simulator will not be able to faithfully reproduce the detailed timing of a real system in all but the simplest cases. This is not really a problem for software development, as the goal is to find and eliminate software bugs: If they occur in the simulator, they could have happened in some real circumstance as well, and thus they are valid bugs that should be fixed.

Furthermore, a simulator offers an interesting variant of running on a different system. The simulator can be used to increase the number of processors present beyond that available on any real-world machine in order to stress the software, and checks its behavior on future machines with more cores. Simulation can make different processors run at different speeds in order to provoke errors.

Performing such corner-case chasing in simulators is common in hardware design, and it has great potential as a tool for software developers as well.

Several vendors offer simulation solutions for embedded systems of varying scope. The investment to create a simulation model can be quite high. But for larger development projects, the benefits typically outweigh the costs, especially when considering the added value of reverse debugging.

Replay Execution
A recurring theme in parallel system debug is the problem of re-executing a multitasking, multiprocessor program, which will result in a different execution, potentially failing to trigger bugs. If a program can be forced to execute in the same way multiple times, debugging will be much easier. Such techniques are known as record-replay techniques. Replay is different from tracing, as the amount of data needed to facilitate a deterministic replay can be made much smaller than a full trace. The trace recorded only needs to contain the non-deterministic events from the real system, like task switches, message arrivals and other communication between tasks [7][8].

To enable replay, the system has to contain a monitor that can record relevant data as well as force the execution to replay the same behavior. Typically, this has to work on the operating-system level, but it can also be applied at the application level if a program has sufficiently well-defined interfaces to the outside world. (A state machine is a good example of this: The time and type of messages arriving will determine its precise behavior).

In a parallel program setting, replay can be done on a single task, on a set of tasks (an entire application), or even on all software running on a single computer node. Recording-based replay allows the debugger to isolate one part of the system, as all input values and asynchronous timings are known. The rest of the system does not need to be running in the debug session, simplifying the problem.

Reverse Debugging
One of the key problems in debugging parallel programs in general, and parallel programs running on multiprocessors in particular, is that re-executing a program to put a breakpoint to locate an error is likely not to reproduce an error.

Instead, a programmer would like to be able to go back from the point at which an error occurs and investigate the events that immediately preceded the error. What is needed here is a tool that allows reverse debugging, i.e. the ability to back up in a program from the current state instead of (attempting to) reproducing the state by executing from the start [12].

Reverse debugging is particularly useful for bugs corrupting the state of the machine. If the stack is overwritten or a process has terminated, there is little material to use for post-mortem bug analysis. By backing up in time, the state prior to the bug can be examined.

Tools for reverse debugging do exist today in the market. They can be based on simulation, trace collection or virtual machines [15]. Such tools offer the ability to back up the execution to examine the path that led to a tricky bug. Tools based on traces will by necessity have a limited window of observation (limited by the size of the trace buffer), while simulation- and virtual-machine-based tools can make use of the resources of a powerful workstation to allow much longer sequences of execution to be observed. Also, most tools do not offer multiprocessor support.

Compared to replay as discussed earlier, reverse debugging is typically faster in practical use, as backing up a short distance in the execution is faster than reloading and replaying a complete execution. One of the main advantages of reverse debugging is actually that a programmer can focus on a small context and go back and forth over it, applying different breakpoints and traces to understand the bug, without long turnaround times that break the flow.

Formal Methods
No overview of parallel debugging techniques is complete without mention of formal methods. Researchers have proposed many different algorithms and techniques for formally proving the absence of errors or automatically locating errors in parallel programs. Some of these methods have made it onto the market. One instance is Intel’s ThreadChecker, a tool that attempts to check that all shared data accesses follow a correct locking regime. If the locking regime assumed by the tool is used in your program, it can offer very good help for locating that particular (and important) class of bugs.

In general, real-world formal method tools applicable to real programs will be analogous to the classic lint program: They will find potential problems and initially produce many false warnings. Not all problems flagged will be real problems, as the tools’ understanding of the code will always be incomplete. They will also only find the types of bugs they were designed to find. For example, a wild pointer will not be found by a tool checking for deadlock in message passing.

To read Part 1 in this series, go to "Programming parallel machines."

Acknowledgements
Thanks to Professor Erik Hagersten, Lars Albertsson, Dan Wallin, Henrik Löf, and my colleagues at Virtutech for discussions and input on multiprocessing and software development.

This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Silicon Valley 2006. Used with permission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.

Jakob Engblom is Business Development Manager at Virtutech and is also an adjunct professor at the department of Information Technology at Uppsala university. He has been speaking about embedded systems programming at university courses and industry trade shows for the past five years. His interests include embedded systems programming, simulation technology, real-time systems, and compiler technology.

  To learn about this general subject on Embedded.com go to More about multicores, multiprocessing and tools.

More Reading
If you want to know more about this subject, here are some recommended articles and books that will provide insight into the issues of parallel programming:

[1] Luiz Barroso. The Price of Performance, ACM Queue, September 2005.
[2] Jeff Bier. Back to the drawing board on multicores, multiprocessing, Embedded.com , Nov 28, 2005.
[3] Michael Christofferson. Using an asymmetric multiprocessor model to build hybrid multicore designs, Embedded.com, Nov 5, 2005.
[4] Scott Gardner. Multicore Everywhere: The x86 and Beyond. Extremetech.com, Jan 9, 2006.
[5] Kevin Krewell. ARM Opens Up to SMP – MPCore Supports One to Four CPUs, Microprocessor Report, May 24, 2004.
[6] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to Parallel Computing (2nd edition), Addison Wesley, 2003.
[7] Joel Huselius. Debugging Parallel Systems: A State of the Art Report. MRTC Report no. 63, (www.mrtc.mdh.se), September 2002.
[8] Charles E. McDowell and David P. Helmbold. Debugging Concurrent Programs, ACM Computing Surveys, December 1989.
[9] Mache Creeger. Multicore CPUs for the Masses, ACM Queue, September 2005. [10] John Siracusa, MacOS X 10.4 Tiger, ArsTechnica (www.arstechnica.com), April 28, 2005.
[11] S. Loveland, G. Miller, R. Previtt, and M. Shannon: Testing z/OS: The Premier Operating System for IBM’s zSeries Server. IBM Systems Journal, Vol 41, No 1, 2002.
[12] Herb Sutter and James Larus. Software and the Concurrency Revolution, ACM Queue, September 2005.
[13] Herb Sutter. The Free Lunch is Over, Dr. Dobbs Journal, March 2005.
[14] John L Hennessy and David A Pattersson. Computer Architecture: A Quantitative Approach, 3 rd Edition. Morgan Kaufmann, May 2002.
[15] Samuel King, George Dunlap, and Peter Chen. Debugging operating systems with time-traveling virtual machines, Proc. USENIX Annual Technical Conference, April 2005.
[16] Kang Su Gatlin and Pete Isensee. Reap the Benefits of Multithreading without All the Work, MSDN Magazine, October 2005.

1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :