CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Debugging real-time multiprocessor systems: Part 1
Programming parallel machines



Embedded.com
Since the early 1980s, embedded software developers have been taking advantage of the steady progress of processor performance. You could count on new processors being made available that would increase the performance of the system without disturbing the software structure overly much. If you used a 500 MHz processor last year, you could buy a 1000 MHz processor next year. You would get a doubling in your application performance without any other work than designing in a new, software-compatible chip.

This performance increase was often factored into application and system design: The software designers planned and implemented application sets that were too heavy for the currently available processors, counting on a new, faster processor to come on the market and solve their performance problems before the product got to market. Such a strategy was necessary to stay competitive.

This comfortable state was driven by the steady progress of the semiconductor industry that kept packing more transistors into smaller packages at higher clock frequencies, enabling performance to increase by both architectural innovations like more parallel pipelines, by adding resources like on-chip caches and memory controllers, and by increasing the clock frequency.

The 32-bit PowerPC family offers a typical case study on this progression. Starting with a 603 processor, users have been able to upgrade with full application-software compatibility to the “G3”/750 processor series (rising from 166 to 1100 MHz over time). Next came the “G4”/7400 series, and then the 64-bit “G5”/970 series, continuing to provide software engineers a performance increase in a single processor.

However, in 2004, it became clear that the progress in single-processor performance began slowing considerably. Due to a number of chip design and manufacturing issues, the clock-frequency increase slowed down, and we could not expect to get twice-as-fast processing from clock speed increases. Instead, the semiconductor industry turned to parallelism to increase performance.

Using multiple processor cores (known as a multicore) on the same chip, the theoretical performance per chip can increase dramatically, even if the performance per core is only improving slowly. Also, the power efficiency of a multicore implementation is much better than traditional single-core implementations, which is a factor as important as absolute performance [1].

Consequently, every high-performance processor family is moving to multiprocessor designs. Freescale has announced the PowerPC 8641D, with two G4-class cores on one chip. IBM has the PowerPC 970MP, with two G5 cores. Intel is pitching dual-core Pentium-M machines to the embedded market, and PMC-Sierra has been selling dual-core RM9200 MIPS64 processors for some time. Cavium has announced up to 16-way parallel MIPS64-based processors. Parallel machines are here in force.

The move to parallel hardware, however, creates problems for software developers. Applications that have traditionally used single processors will now have to be parallelized over multiple processors in order to take advantage of the multiple cores. Programs that used to run on single processors will have to run on multiprocessors. Apart from the act of creating parallel software, debugging is greatly complicated by the combination of parallel execution and tighter packaging [2][8].

This will be the greatest challenge to the embedded software industry for a very long time. Both tools and thinking need to change. The level of conceptual change is comparable to the introduction of multitasking operating systems, virtual memory and object-oriented programming.

Multi-what?
Multiprocessing systems can be built in a number of ways, and there are several terms referring to various forms of packaging and implementing multiprocessor systems.

1) Multitasking means that a single processor can run several different software tasks, scheduled by a (real-time) operating system.

2) A multiprocessor is any computer system using more than one processor to perform its work.

3) A shared-memory or symmetric multiprocessor (SMP) is a design where the processors in a multiprocessor share the same memory, and all can access the same data and run the same tasks. In contrast, an asymmetric multiprocessor (AMP) uses several processors with their own local memories, often using different architectures.

4) A multicore processor is a single processor chip containing multiple processor cores, which can be identical (as on the Freescale MPC8641D) or of different types (as on a Texas Instruments OMAP). Essentially, it is a multiprocessor on a single chip. There is some debate on precisely what deserves to be called multicore [4], but in this paper we suppose that anything with more than one processing core on a single chip qualifies.

5) Hardware/symmetric multithreading (S/MT) is a technique where a single processor core is used to execute several parallel threads of computation. This provides less performance than a true multicore design, but also has a much lower implementation cost. Examples include the IBM Power5 and Intel Pentium4 HyperThreading.

6) A chip multiprocessor (CMP) is a processor employing multithreading and/or multicore techniques to create a parallel computer on a single chip. An example is the Sun “Niagara” UltraSparc T1. CMP designs are typically SMPs, as they mainly come from the mainstream PC/server market.

In the rest of this set of articles, we will use the term “multiprocessor” to denote any computer system containing more than one thread of control in hardware.

Software Structure. On the software side, we prefer to use the generic term “task” to mean a single thread of computation. We want to avoid confusion between “process” and “processor.” A number of tasks can share memory in order to implement an application, and this is really what is meant by a parallel program: A number of tasks cooperate to perform the function of a particular application.

Many operating systems use the term “process” to mean a single memory space in which one or more “threads” can run. Between processes, memory is not shared. Commonly, there is only one thread of computation inside a process, and people tend to use “process” to mean a single thread of execution.

Programming Parallel Machines
There are several ways in which to write code to run on a multiprocessor machine. The most important distinction is between message passing and shared memory styles. In message-passing, communication is explicitly programmed, while shared-memory makes communication implicit in the reading and writing of variables. Note that message-passing programming can make sense even on a shared-memory machine, where messages are implemented using shared memory.

Classic supercomputing (weather prediction, astronomy simulations, airplane design) typically uses sequential languages like C or FORTRAN with a library (pthreads) or compiler directives (OpenMP, MPI) to divide a computation task up into parallel tasks. Shared memory and message passing are both in use, depending on the nature of the machines used.

This programming tradition usually creates programs that use a number of tasks in parallel to solve a small part of a single computation problem. Typically, parallelism is applied to single kernel loops. This model works well for the target domain, and supercomputing computations have been successfully parallelized to use 10,000s of processors in parallel.

In the desktop and server market, commercial multitasking and multiprocessing workloads are programmed using sequential languages (C, C++, Ada, Java, COBOL), using libraries, OS APIs or language support. Synchronization and communication can be provided as part of the language (Java and Ada), the operating system API, a library like pthreads or compiler directives like OpenMP and MPI.

To contrast with supercomputing, parallel tasks are typically not homogeneous, but rather perform quite different computations based on a common data set. The most successful class of parallel programs in this space is servers, where each task handles an individual independent session with clients. Another successful example is using a task to run a user interface in parallel to the core application logic.

OpenMP is proving to be a popular way to provide parallel programming support for new machines. For example, the ARM MPcore has an OpenMP implementation available, and Microsoft provides it for their Xbox 360 game console [6][16]. Another trend is that virtual machine-based languages like Java and C# have started to provide support for managing parallelism as part of the virtual machine definition.

There are also parallel languages which go beyond the simplistic approach in Ada, Java and C# to make parallelism an integrated feature of the language. OCCAM is designed for supercomputing, where specific computation parts can be specified to execute in parallel (using constructs such as parallel loops).

Erlang is an interesting language designed for control-plane applications, using hundreds or thousands of tasks each with their own local storage and communicating using message-passing. Tasks are used as the primary means of program structuring — rather like objects are used on object-oriented programming.

Writing parallel programs seems to be easier where large data sets are being manipulated, like supercomputing and databases. Parallelizing programs with smaller data sets is harder by experience. However, there is one mitigating factor in the current move to multicore implementations: Onboard a multicore chip, communication is much faster than in a traditional multi-chip multiprocessor. This helps programmers write efficient programs, and should allow beneficial parallelization of more types of programs.

In this series of articles, however, we will mainly consider the case of using C or C++ with shared memory, as that is the model with the most subtle debugging problems, and the one that most embedded programmers will encounter when multicore processors replace old single-core processors at the core of their products.

Overall, the move to multiprocessing constitutes a break with the mainstream (and embedded) software-development tradition. Most programmers have been taught how to program single-threaded programs. Few real-time operating systems support SMP processing. Debuggers (with a few exceptions) operate under the assumption that we are debugging a program that is serial, repeatable and self-contained.

Embedded Multiprocessing
Parallel processing per se is nothing new to embedded systems. Many embedded applications are using parallel processing with great success.

For example, signal-processing in telecommunications makes use of arrays of digital signal processors (DSPs) to handle wireless and wired signals. Routers use multiple input and output ports with multiple traffic engines to scale to extreme traffic volumes. Automotive applications use large numbers of small processors spread throughout a vehicle. Control applications are distributed across multiple processor cards. Mobile phones contain a multitude of processors in parallel to handle signal processing, applications, Bluetooth ports, etc.

The motivation for multiprocessor designs have mainly been to achieve required absolute performance, as well as cost and power efficiency of the overall solution.

However, most such applications have used asymmetric multiprocessing (AMP), where each processor has its own local memory and performs work that has been statically assigned to it. Each processor has its own private set of tasks running (or tasks that can run), and running tasks do not migrate between processors. The computations to be performed naturally divide into independent tasks and are “embarrassingly parallel” [12][13].

For example, in a mobile phone the user interface can run in parallel to the radio interface, and a mobile phone base station has one thread (or more) per active mobile terminal. Such parallel computation is easier to understand and design than general shared-memory processing [3].

However, debugging an asymmetric parallel system is still harder than debugging a single-processor system. Fundamentally, humans are poor at thinking about parallel systems; we seem to be wired to handle a single flow of events better than multiple simultaneous flows.

The problem now is that the embedded systems software industry has to figure out how to employ shared-memory symmetric multiprocessing to handle applications that have been traditionally implemented on single processors as single tasks.

Processor designers have assumed that delivering multicore as a way to avoid the power implications of simply increasing the clock rate is acceptable and that software engineers will have no difficulty making use of processor power delivered in this way. While true for servers, it is not true for most applications of interest, and this creates the current paradigm shift in the industry.

Seymour Cray says, “If you were plowing a field, which would you rather use: two strong oxen or 1,024 chickens?” It seems that we are forced to start harnessing these chickens.

In a shared-memory system, a multiprocessor runs a single instance of an operating system, and tasks can execute on any processor (as decided by the operating system scheduler). Shared-memory systems have many advantages, particularly in balancing loads and migrating tasks between processors. It is more power-efficient to use several processors running at a low clock frequency than a single processor at a high frequency. For example, the ARM11 MPCore multiprocessor varies both the number of active cores and their clock frequency to process a workload with maximum efficiency [5].

Figure 1

As illustrated in Figure 1 above, we can expect future embedded systems to contain several shared-memory multi-processor nodes (on individual boards or blades or cards), connected using a network. Each individual processor core in the system has a local level 1 cache, and shares level 2 cache (if present) and the main memory and device set with the other cores on the same node. Within each SMP node, the caches are kept coherent so that the node presents a single unified memory from the perspective of the software, despite the memory being spread out in several processor-local caches.

The Software Breaks
Ignoring the issues of creating efficient parallel programs for the moment, even getting programs to function correctly in an SMP environment is harder than for a single processor. Existing software that has been developed on a single processor might not work correctly when transitioned onto a multiprocessor. That an application works correctly in a multitasking environment does not imply that it works correctly in a multiprocessing environment; serious new issues occur with true concurrency.

True parallel execution (or concurrency) makes it hard to establish the precise order of events in different concurrent tasks. The propagation time of information between processors in shared-memory multiprocessors is not zero (even if it is fairly short, maybe a few hundred clock cycles at most), and this is sufficient to create subtle random variations in the execution, which can snowball. A multiprocessor by nature exhibits chaotic behavior where a small change in initial parameters gives rise to large differences in system state over time.

The system is actually inherently unpredictable, at least in terms of timing, and correct function can only be achieved by allowing for this and designing code, which works even in what seems like bizarre circumstances.

The following catalogue of problems attempts to highlight the many new and interesting ways in which software can break on a multiprocessor.

Latent Concurrency Problems. There can be latent problems in an existing, proven, multitasking workload that runs just fine on a single processor. The presence of true concurrency makes mistakes in protecting against concurrent accesses much more likely to trigger and cause program errors. As the timing of tasks becomes more variable, and they run in parallel for longer periods of time, the task set is simply subjected to more stress. This effect is similar to optimizing a program in a C compiler: optimizations might expose bugs in the program that were previously hidden. The error was always there; it just didn’t manifest itself.

Missing Reentrancy. To make efficient use of a multiprocessor, all code that is shared between multiple tasks has to support reentrant execution. This means using locks to protect shared data and to allocate local data for each time a function is called. Shared state between invocations of the same function has to be avoided. In a multi-processor environment, actual occurrences of multiple tasks using the same shared function simultaneously will occur much more frequently (and thus trigger bugs. This effect is especially important for operating systems and shared libraries, as such code will be used heavily by multiple tasks.

Priorities do not Provide Mutual Exclusion. In application code written for a traditional single-processor, strictly priority-scheduled RTOS, a common design pattern to protect shared data is to make all tasks that access the same data run at the same priority level.

With a strict priority-driven scheduler, each process will run to completion before the next process can run, giving exclusive access without any locking overhead. This will fail when true concurrency is introduced in a multiprocessor, as shown below in Figure 2 of a typical case:

Figure 2

This code is multitasking-safe on a single processor, but will break on a multiprocessor. This means that even existing proven code will have to be reviewed and tested before it can be assumed to run correctly on a multiprocessor. Explicit locking has to be introduced in order to handle the access to shared data.

One solution proposed for maintaining the semantics of single-processor priority scheduling on an SMP is to only run the task(s) with highest priority. Thus, if there is only a single highest-priority task in a system, only one processor will be used and the rest left idle. This ensures that high-priority tasks do not need to worry about simultaneous execution of lower-priority tasks, but does not solve the problem for tasks with the same priority when priority is used (incorrectly) as a serialization device.

Interrupts are not Locks. In the operating system and device driver code, you can no longer simply assume that you get exclusive access to shared data by turning off all interrupts. Instead, SMP-safe locking mechanisms have to be used. Redesigning the locking mechanisms in an OS kernel or driver (or user application, in case it makes use of interrupt management) is a major undertaking in the change to multiprocessors, and getting an operating system to run efficiently on an SMP will take time [9][10].

Race Conditions. Race conditions are situations where the outcome of a computation differs depending on which participating task gets to a certain point first. They typically trigger when some piece of code takes longer than expected to execute or timing is disturbed in some other way. Since race conditions are inherently timing-related, they are among the hardest bugs to find. The name comes from the fact that the tasks are racing forward in parallel, and the result depends on who gets to a certain point first.

Figure 3

Shown on the left in Figure 3, above, is the classic data race, where a piece of data shared between two tasks and the tasks do not protect the common data with a lock. In this case, both tasks can be editing the data at the same time and will not correctly account for the updates from the other task.

If task 1 was fast enough, it could finish its editing before task 2 begins, but there are no guarantees for this. Note that shared data is often a complex data structure, and the net result of a race is that different parts of the structure have been updated by different tasks, leading to an inconsistent data state.

On the right in Figure 3 above is another type of race, the message race, where one task is expecting a series of messages from other tasks. Such messages can come in any order, as there is no synchronization between the tasks; if task 2 implicitly expects a certain order, it will get the wrong results if task 3 happens to send before task 1. Whenever the ordering of events is important, explicit synchronization has to be in place.

Note that races occur in multitasking single-processor systems too, but there they are less likely to trigger, as they require a task switch to occur at an unlucky time. In a true concurrent system, races will happen more often, as discussed in earlier.

Deadlocks. When all shared data is protected by locks, you get into another situation where the locking itself can be the cause of errors. If two tasks are taking multiple locks in different order, they can get stuck, both waiting for the other task to release the other lock. A simple but fairly typical example is shown in Figure 4 below:

Figure 4

Task T1 locks lock L1 first, which protects the variable V1. After awhile, T1 also needs to work on variable V2, protected by lock L2, and thus needs to lock L2 while still holding L1. This code in itself is sound, as all accesses to shared data is correctly protected by locks. In task T2, work is being performed on variable V2, and lock L2 is taken. When calling the function foo(), V1 is also accessed, and locking is in place.

Assume a scenario where T1 and T2 start at the same time, and manage to obtain one lock each. When T1 gets to the point lock(L2), it stops and waits as T2 is holding that lock. Slightly later, T2 gets to the call to function foo(), and duly tries to lock L1. At this point, we are stuck, as T1 and T2 are mutually waiting for the other task to release a lock, which will never happen. We have a deadlock.

The example above illustrates a common cause of deadlocks: calling into functions like foo(), which do locking invisible to the caller. In a complex software system, it happens easily that locks get taken in a bad order if no measures are taken to ensure consistent locking orders.

Partial Crashes. When a computation involves multiple tasks that compute different parts of an overall solution, it is possible that one (or more) of the tasks crash during the computation. This problem in a single task will then escalate to a parallel problem, as the other tasks wait forever for the crashed task to get back with its part of the result.

Silent Parallelization Errors. Parallel code written with explicit calls to threading libraries will naturally have to check whether thread creation and other operations succeeded, but when using indirect parallel programming by compiler directives like OpenMP, error handling is often suboptimal.

The compiler will create extra code in the application to start tasks and handle synchronization, and this code does not have a way to report problems to the main user code. If a program fails to create the tasks it needs at run-time, it might just crash or hang or fail silently with no message to the user or programmer indicating a problem. Such behavior makes it unsuitable for high-reliability code where error recovery is necessary.

Bad Timing Assumptions. Missing synchronization is a common theme in parallel programming bugs. One particular variant to be wary of is the assumption that one task is going to finish its next work long before some other task needs the results of this work. In essence, as a task T1 is doing something short and simple, we expect that work to be completed before a task T2 finishes something long and complicated, and we know that they start out at the same time. A simple example is shown below in Figure 5:

Figure 5

Here, we expect T1 to finish writing V long before T2 is done with its initialization. However, it is possible that T2 in some situation completes its initialization before T1 can complete its write. This sort of bug typically triggers under heavy load or unusual situations in a shipping system.

Relaxed Memory Ordering Wreaks Havoc
For performance reasons, a multiprocessor employs various forms of relaxed memory orders (also known as weak memory consistency models). Basically, the computer architecture specifically allows memory operations performed to be reordered in order to avoid stalling a processor when the memory system is slow.

In most such models, memory operations from one processor may be observed in a different order from the viewpoint of another processor. This is necessary in order to get any efficiency out of a multiprocessor system. There are a number of relaxed memory orderings1, differing in how operations can bypass each other [14].

(Note that the potential operation reorderings allowed by a specific memory consistency model are precisely defined, but that even so, understanding the implications of a particular memory consistency model requires deep considerations. University students usually find memory consistency the hardest part of computer architecture. )

Understanding relaxed memory orders is probably one of the hardest parts of understanding parallel computing [13], and unfortunately, they are visible to a programmer since they affect data-transfer operations between parallel threads. Many data exchange algorithms that are correct on multitasking single processor systems break when used on multiprocessors.

For example, reads can almost always be performed out of order with respect to the program, and sometimes writes might be seen in different orders on different processors, especially when they originate from different processors.

In general, it is necessary to use special operations like memory barriers to guarantee that all processors in a system have seen a certain memory operation and that the state is thus globally consistent. Failing to deal with memory consistency will result in intermittent timing-sensitive bugs caused by processors observing the same execution in different orders.

One example is the classic Dekker locking algorithm shown in Figure 6 below, which is perfectly valid on a single processor (assuming that each read of a flag variable is atomic), no matter how tasks are interleaved.

Figure 6

Assuming that all variables are zero to begin with, it is quite possible to read old values for x and y in task T2 despite the use of a “lock,” since there is no guarantee that the writes to variables x and y are seen by task T2 before the write to flag1. With a relaxed memory ordering, you cannot assume that writes complete in the same order as they are specified in the program text.

For example, if the write to flag1 hit the local cache, and x and y missed the cache, the net result could be that the write to flag1 takes a longer time to propagate. The execution is illustrated on the right in the picture above (in which we assume that each task runs on its own processor).

Note that an additional task 3 can see the writes in the “expected” order – which is the common case. The rare case is the problem case. The programming solution is to put in barriers, forcing memory writes and reads to complete before the tasks continue beyond the lock. The cost is in performance, as processors stall when memory operations run to completion.

Also note that using the C keyword “volatile” has no effect on memory consistency; it does guarantee that the compiled code writes variable values back to memory, but provides no guarantees as to when other processors will see the change. This is a symptom of a general problem in that most language definitions do not consider the implications of multiprocessors.

A more complex example that fails for the same reason is the following scheduling code from the SPLASH program in Figure 7 below:

Figure 7

Here, the worker tasks T2…Tn wait for the linked lists of work units to become available from the master thread T1, and then grab work units off of it protected by a lock. However, considering the possibility of write reordering, the work unit data read by a worker task need not be complete just because the linked list has been updated. And if the locking is not multiprocessor-aware, the whole task queue can be destroyed by processors having a different idea of the value of Head.

To read Part 2 in this series go to "Debugging parallel programs."

This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Silicon Valley 2006. Used with permission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.

Jakob Engblom is Business Development Manager at Virtutech and is also an adjunct professor at the department of Information Technology at Uppsala university. He has been speaking about embedded systems programming at university courses and industry trade shows for the past five years. His interests include embedded systems programming, simulation technology, real-time systems, and compiler technology.


1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :