Simulating and debugging multicore behavior -

Simulating and debugging multicore behavior

Multicore microprocessor chips are on their way, and they're going to further complicate the task facing embedded software developers. Of course, multiprocessor systems aren't new. Chips with multiple heterogeneous (different) processors, such as a RISC and a DSP, have been around for years. In fact, nearly every modern cell phone contains just such a pair.

What's new is that the number of microprocessors is dramatically increasing in order to handle the equally dramatic increase in system-on-a-chip (SoC) software content, and that these processors generally share cache memory. This approach, known as shared-memory multiprocessing or symmetric multiprocessing (SMP), adds a whole new level of complexity because software will normally need to be dynamically partitioned across the processors. Traditional static partitioning won't work.

Moreover, design teams are frequently using parallel processing, or true concurrency, to meet the system's performance specification within its power constraints. The combination of SMP and true concurrency further exacerbates the software development, validation, and debug problems to the point where traditional software-development approaches are breaking.

In this article I'll discuss these trends, explore their development problems, and describe a behavior-accurate simulation that you can use to solve them.

Poor visibility
Software developers generally think that system simulation is too slow. Of course, to a software developer, anything slower than real time often is too slow. The traditional approach to developing and debugging software for multiprocessor systems is slow, and it's excruciatingly so when applied to SMP systems. Worse, using traditional approaches, the “bug escape” problem in SMP systems is exponentially greater than it is in comparatively simple multiprocessor systems with local cache architectures.

According to the Embedded Market Forecasters' report, approximately one quarter of embedded system designs missed the project schedule by at least 50%; about one third missed functional specifications by at least 50%; and more than 70% missed performance specifications by at least 30%.1, 2 This dismal performance hardly constitutes a robust defense of traditional approaches versus new methods.

The report also cites the reasons for failure: about two thirds of the survey sample identified “limited visibility into the complete system” as their major problem; more than half identified “limited ability to trace”; and nearly half identified “limited ability to control execution.”

System simulation provides the systemwide controllability, observability, and determinism necessary to solve these problems. Moreover, a simulation technology that executes backward as well as forward has a clear advantage over traditional debug approaches. Overall, unlike traditional approaches, system simulation can enable software development and debug to commence long before a hardware prototype is available; scale with the size and complexity of the system; and solve up front the problems that are all too often detected only toward the end of the project–when it's sometimes too late and always costly to fix them.

Performance woes
What are the development problems, specifically? We'll start with the performance challenge. The usual way to get better performance is to migrate to a processor built in next-generation semiconductor technology and crank up the clock rate, relying upon process scaling to keep power consumption within limits. This technique no longer suffices. According to Bernard Meyerson, chief technologist of IBM's Systems and Technology Group, process scaling “broke” at 130nm (0.13 micron) because “atoms don't scale.”3 The result is a power density nearly an order of magnitude greater than would have been achieved had such scaling remained intact. To quote Pat Gelsinger, CTO at Intel, cranking up the clock rate would produce “. . . a heat generator with the intensity of a nuclear reactor. I don't think we know how to cool a 5,000-watt processor.”4

Moreover, standard microprocessor cores often lack the parallel hardware resources needed to execute multiple–and ever more complicated–algorithms with the requisite performance. The combination of broken process scaling and inadequate single-processor parallelism forces us into parallel processing by some other means, namely, the multiprocessor approach. Consequently, semiconductor manufacturers such as Cavium, Freescale, IBM, and PMC-Sierra are introducing multicore processors. Even the workhorse PC is deploying multicore chips from Intel and AMD.

The power-consumption constraints of a multiprocessor chip rule out the luxury of multiple local cache memories like a board-level solution might have. In an SoC, multiple processors must share the cache, with its increased risk of data traffic jams and cache misses. Clearly, this aggravates the already complex nature of software development for multiprocessor systems. The SMP software challenge is particularly acute because, although the hardware resources deliver true concurrency, most software routines assume pseudo-concurrent architectures. In other words, the software was developed for single-processor execution. In fact, ensuring that software code executes reliably on multiprocessor systems is a much more difficult problem than simply hitting performance targets.

For instance, in a truly concurrent system several instances of the same process, accessing the same data, might execute simultaneously. Running field-proven multitasking software on an SMP architecture may well expose concurrent-access conflicts that were latent in the original single-processor code.

Another example is that no given processor in a truly concurrent system has exclusive access to shared data, although a pseudo-concurrent routine often assumes such access simply by disabling all interrupts. An interrupt is not a locking mechanism, however. SMP locking mechanisms are needed, which often require time-consuming redevelopment of the operating system kernel and drivers and even of the application software itself if it manages interrupts.

A further point is that the single-processor software-locking mechanisms that safeguard access to data shared by different tasks often don't work in a truly concurrent system. Code in an SMP system must enable reentrant execution. That is, data shared by different tasks must be locked, while allowing data specific to a particular task to be freely accessible. Clearly in an SMP architecture far more tasks will try to access shared data than would occur in a single processor implementation.

From the foregoing, it's obvious that software partitioning and scheduling for truly concurrent systems require that software developers possess a knowledge and understanding of system architecture that was unnecessary for single-processor systems and a lot less necessary in board-level multiprocessor systems. Moreover, software partitioning is no longer a static exercise. The complexity of multithreaded processes necessitates dynamic partitioning, an iterative process that requires a great deal of system analysis.

Now, what about debug? It consumes about half of the software-development effort because new chip prototypes have even less controllability and observability than their board-level brothers. Consequently, your rooms full of test and measurement equipment for debugging board-level systems might be of limited use for SoC debugging. In board-level systems, the test-and-measurement approach certainly enables the design team to eradicate every bug that it can detect, but it often cannot detect every bug–not least because bugs in complex systems are often intermittent and difficult to reproduce over multiple tests. In other words, the systems aren't deterministic. Earlier availability of hardware prototypes and more development time can't solve this problem because high-confidence debug requires higher levels of system controllability and observability than hardware prototypes and test-and-measurement equipment can achieve.

What is required to develop and debug SMP software? Basically, you need:

• Greater visibility of the system-level architecture

• Greater controllability and observability of system behavior

• Behavioral determinism

If hardware prototypes, emulators, and the best test-and-measurement equipment money can buy can't meet these requirements, what then is the solution?

I believe the solution to all of these problems is behavioral simulation and simulation-based debug. System simulation using a behavioral model of the hardware provides the requisite system-level view, comprehensive controllability and observability, and determinism that are unachievable using hardware prototypes. The behavioral model can be ready much earlier than hardware prototypes can, enabling you to start developing the software and thoroughly debugging the system much earlier.

A behavioral model provides a real-behavior view of the system, accurate at the software/hardware boundary and operating at the level of abstraction understood by software developers, such as registers and interrupts. Model complexity is independent of system complexity, so there is no practical limit to the system size and complexity that can be modeled. The model is independent of hardware implementation, and can therefore be applied to both chip-level SMP designs and board-level multiprocessor designs (from big chips to big boxes). Moreover, an abstract model of the environment in which the system operates may be used to provide near real-world stimulus.

The behavioral simulator runs this model with real binary code for comprehensive analysis and debug of the system, from applications software through the operating system kernel and device drivers. This behavior-accurate simulation and analysis is deterministic; multiple executions of the same test case will yield the same results, eliminating problems of bug reproducibility. Because production code is executed without modification, the simulation can also be used in much of the regression testing, final testing, and even in certification.

Because of the behavioral model's high level of abstraction, there is no hard limit to the number of processor cores, with the same or different operating systems, executing in parallel that the simulator can run, analyze, and debug. Speed degradation from simulating more processors can generally be alleviated by using more host computers.

The behavioral simulator's controllability allows single-stepping code on a given processor, line by line. The simulator displays the step-wise progress of every system node state, in contrast to clock-driven real hardware prototypes. It also enables breakpoints at any arbitrary point in the process, something that's difficult or impossible with hardware prototypes. The simulator supports debugging a single process executing on multiple processors, using the appropriate operating-system support package, together with GDB, or other such debuggers. It can also debug multiple processes, using process-specific debuggers, any number of which may be deployed.

Note that API-level operating-system simulators that are “behavior-approximate” don't run the actual binary code and can't perform such accurate simulations and analyses. This accuracy deficit limits verification and debug confidence. In other words, it allows bug escapes.

The behavioral simulator's controllability enables the designer to inject faults to test the system's fault detection and recovery mechanisms. For instance, in one project the device model used to measure processor temperature was scripted to simulate an excessively high temperature to validate correct system shutdown. Triggering a similar event in real hardware would have required a blowtorch. Another example is that of intentionally delaying data writes between processors to detect parallel code errors. Similar delays can be inserted to stress locking behavior.

Simulation enables stress testing on a scale not achievable by hardware prototypes. For instance, a routine of 128 threads executing on eight microprocessors can be stressed by executing it on more processors, even to the extreme of one processor per thread. This approach future-proofs the software for execution on the greater number of processors that would typically be deployed in next-generation designs. Try doing that quickly and cost-effectively in hardware.

The most impressive difference, however, between simulation-enabled debug and traditional debug is that the system can execute backward as well as forward. Backward execution enables the system to be stepped back into a fault condition after the system has run over it. It reverses the whole system, even disks, network devices, and terminals, and is effective with operating-system crashes, segmentation faults, and accidental file deletions. By comparison, forward-tracing is an iterative one of “run over and restart” and, in nondeterministic multiprocessor hardware, even that doesn't guarantee bug reproduction.

Running real hardware backwards is obviously impossible. Historically, backward simulation has been impractical because of the voluminous state data, or “checkpoints,” that must be recorded. The combination of a very high-performance simulator with less-frequent capture of state data, however, enables you to return to a checkpoint and then run forward in a way that appears to be instantaneous.

An example project that demonstrates the effectiveness of this approach includes sixty-four 64-bit processors simulated and debugged on a midrange PC with 8GB of memory. Another project had 160 PowerPC cores simulated on a PC with only 4GB of RAM. A more impressive debug example is that of a router that crashed before it had booted far enough to be operated. Three engineers spent two weeks analyzing the software to detect the fault, without success. The simulation technology found the fault in 30 minutes; it was due to misaligned data structures.

Simulating behavior
So, how does behavioral simulation work, and how does it achieve reasonable speeds? The simulator is event-driven and models and analyzes on-chip communication at the transaction level, eliminating the need to use actual, bit-level bus traffic. By modeling I/O accesses as single, synchronous events, and transmitting and receiving data in the form of packets, the simulator rapidly simulates processor workloads using the real operating system, network protocol stacks, and applications.

The simulator uses a single global virtual time base with which all processors and other devices are synchronized. It thus provides “global synchronization and stop,” whereby it halts the whole system simulation when one part of the system is stopped. This enables single-step operation and deterministic debugging. Every system node is observable and traceable, and its state and behavior over virtual time can be logged. System tracing can be used to comprehensively profile the software and perform code coverage analysis.

Just-in-time (JIT) compiler approaches further optimize the most intensively executed code. In addition, the simulator can control time to accelerate both repetitive and idle processes. For instance, in a repetitive process such as a register-polling loop, simulation can achieve speeds of five billion instructions per second. In the case of a mostly idle network of 100 small sensors, the simulation executed at five times real-world speed.

The behavior-accurate system model is functionally correct, with timing annotations such as fixed execution time for processor instructions and a simplified timing schema to define transaction completion times. This coarse-grained timing enables validation of the software code without the need to build accurate, real-hardware timing models.

These modeling and simulation attributes are of value not only in software development and debug, but also in final testing and certification. A model of the system's environment can be used as a near real-world test bench to provide the electronic stimulus necessary to test the system. The simulator can also analyze the system's behavior under the influence of nonelectronic system components, such as mechanical components, through direct interfaces to component models. These abstract models use the same virtual time as the simulator, which computes the environmental behavior in the same time domain as that used to execute the software.

Clearly, simulation cannot completely replace real hardware and those rooms fulls of test and measurement equipment. After all, nobody wants to drive a car with airbag software that has never been tested in a vehicle driven at high speed against a concrete wall. Real hardware is still required to verify the product under real operating conditions.

What simulation can do is execute those front-end software-development and debug tasks that real hardware is incapable of and accelerate those tasks for which real hardware is adequate.

System simulation offers the system controllability, observability, and determinism that are critical to solving the endemic problems of “limited visibility into the complete system,” “limited ability to trace,” and “limited ability to control execution.” Real hardware has been the default development platform because no viable alternative was available–until now.

Peter S. Magnusson is founder and chief technology officer at Virtutech, Inc. He served as CEO of Virtutech from its founding in 1998 to 2005. After concluding studies at the Stockholm School of Economics and the Royal Institute of Technology (Computer Science), he conducted research in simulation technology at the Swedish Institute of Computer Science since 1991, with a number of international publications in the field. For several years he was a columnist for the computer magazine Datateknik and an advisor for the SalusAnsvar öhman IT-Fond, a mutual fund. You can reach him at .

1. Embedded Market Forecasters. “Embedded market forecast: Embedded Software Development: Issues and Challenges. July 2003”:


2. McDonald, Jon. “How ESL becomes a business imperative,” EE Times /, 01/17/2005:


3. Fuller, Brian. “Designs better be holistic, says DAC keynote speaker,” EE Times , 06/14/2005:


4. Gelsinger: ISSCC keynote as reported in many places including in Ohr, Stephan. “Analog guys take the heat at Intel forum,” EE Times , 03/26/2001:


Reader Response

The article rightly brings out the relevance of system level (behavioural) models for debugging complex systems (SMP based) otherwise difficult to debug. I however would like to stress the use of this approach for modern day embedded systems even though not using SMP for the very reason of reducing total design/development time. There are attempts (by a large community of researchers) at filling the gap of migrating from untimed behavioural models (executable specification is yet another name) to timed behavioural models, the later can provide more accurate (and thus increase our confidence) picture of system performance albeit simulated.

– Rajveer S Shekhawat
Technical Consultant

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.