Since the early 1980s, embedded software developers have been takingadvantage of the steady progress of processor performance. You couldcount on new processors being made available that would increase theperformance of the system without disturbing the software structureoverly much. If you used a 500 MHz processor last year, you could buy a1000 MHz processor next year. You would get a doubling in yourapplication performance without any other work than designing in a new,software-compatible chip.
This performance increase was often factored into application andsystem design: The software designers planned and implementedapplication sets that were too heavy for the currently availableprocessors, counting on a new, faster processor to come on the marketand solve their performance problems before the product got to market.Such a strategy was necessary to stay competitive.
This comfortable state was driven by the steady progress of thesemiconductor industry that kept packing more transistors into smallerpackages at higher clock frequencies, enabling performance to increaseby both architectural innovations like more parallel pipelines, byadding resources like on-chip caches and memory controllers, and byincreasing the clock frequency.
The 32-bit PowerPC family offers a typical case study on thisprogression. Starting with a 603 processor, users have been able toupgrade with full application-software compatibility to the “G3”/750processor series (rising from 166 to 1100 MHz over time). Next came the“G4”/7400 series, and then the 64-bit “G5”/970 series, continuing toprovide software engineers a performance increase in a singleprocessor.
However, in 2004, it became clear that the progress insingle-processor performance began slowing considerably. Due to anumber of chip design and manufacturing issues, the clock-frequencyincrease slowed down, and we could not expect to get twice-as-fastprocessing from clock speed increases. Instead, the semiconductorindustry turned to parallelism to increase performance.
Using multiple processor cores (known as a multicore) on the samechip, the theoretical performance per chip can increase dramatically,even if the performance per core is only improving slowly. Also, thepower efficiency of a multicore implementation is much better thantraditional single-core implementations, which is a factor as importantas absolute performance .
Consequently, every high-performance processor family is moving tomultiprocessor designs. Freescale has announced the PowerPC 8641D, withtwo G4-class cores on one chip. IBM has the PowerPC 970MP, with two G5cores. Intel is pitching dual-core Pentium-M machines to the embeddedmarket, and PMC-Sierra has been selling dual-core RM9200 MIPS64processors for some time. Cavium has announced up to 16-way parallelMIPS64-based processors. Parallel machines are here in force.
The move to parallel hardware, however, creates problems forsoftware developers. Applications that have traditionally used singleprocessors will now have to be parallelized over multiple processors inorder to take advantage of the multiple cores. Programs that used torun on single processors will have to run on multiprocessors. Apartfrom the act of creating parallel software, debugging is greatlycomplicated by the combination of parallel execution and tighterpackaging .
This will be the greatest challenge to the embedded softwareindustry for a very long time. Both tools and thinking need to change.The level of conceptual change is comparable to the introduction ofmultitasking operating systems, virtual memory and object-orientedprogramming.
Multiprocessing systems can be built in a number of ways, and there areseveral terms referring to various forms of packaging and implementingmultiprocessor systems.
1) Multitasking means thata single processor can run several different software tasks, scheduledby a (real-time) operating system.
2) A multiprocessor is anycomputer system using more than one processor to perform its work.
3) A shared-memory orsymmetric multiprocessor (SMP) is a design where the processors in amultiprocessor share the same memory, and all can access the same dataand run the same tasks. In contrast, an asymmetric multiprocessor (AMP)uses several processors with their own local memories, often usingdifferent architectures.
4) A multicore processor isa single processor chip containing multiple processor cores, which canbe identical (as on the Freescale MPC8641D) or of different types (ason a Texas Instruments OMAP). Essentially, it is a multiprocessor on asingle chip. There is some debate on precisely what deserves to becalled multicore , but in this paper we suppose that anything withmore than one processing core on a single chip qualifies.
5) Hardware/symmetricmultithreading (S/MT) is a technique where a single processor core isused to execute several parallel threads of computation. This providesless performance than a true multicore design, but also has a muchlower implementation cost. Examples include the IBM Power5 and IntelPentium4 HyperThreading.
6) A chip multiprocessor(CMP) is a processor employing multithreading and/or multicoretechniques to create a parallel computer on a single chip. An exampleis the Sun “Niagara” UltraSparc T1. CMP designs are typically SMPs, asthey mainly come from the mainstream PC/server market.
In the rest of this set of articles, we will use the term“multiprocessor” to denote any computer system containing more than onethread of control in hardware.
SoftwareStructure. On the software side, we prefer to use the genericterm “task” to mean a single thread of computation. We want to avoidconfusion between “process” and “processor.” A number of tasks canshare memory in order to implement an application, and this is reallywhat is meant by a parallel program: A number of tasks cooperate toperform the function of a particular application.
Many operating systems use the term “process” to mean a singlememory space in which one or more “threads” can run. Between processes,memory is not shared. Commonly, there is only one thread of computationinside a process, and people tend to use “process” to mean a singlethread of execution.
Programming Parallel Machines
There are several ways in which to write code to run on amultiprocessor machine. The most important distinction is betweenmessage passing and shared memory styles. In message-passing,communication is explicitly programmed, while shared-memory makescommunication implicit in the reading and writing of variables. Notethat message-passing programming can make sense even on a shared-memorymachine, where messages are implemented using shared memory.
Classic supercomputing (weather prediction, astronomy simulations,airplane design) typically uses sequential languages like C or FORTRANwith a library (pthreads) or compiler directives (OpenMP, MPI) todivide a computation task up into parallel tasks. Shared memory andmessage passing are both in use, depending on the nature of themachines used.
This programming tradition usually creates programs that use anumber of tasks in parallel to solve a small part of a singlecomputation problem. Typically, parallelism is applied to single kernelloops. This model works well for the target domain, and supercomputingcomputations have been successfully parallelized to use 10,000s ofprocessors in parallel.
In the desktop and server market, commercial multitasking andmultiprocessing workloads are programmed using sequential languages (C,C++, Ada, Java, COBOL), using libraries, OS APIs or language support.Synchronization and communication can be provided as part of thelanguage (Java and Ada), the operating system API, a library likepthreads or compiler directives like OpenMP and MPI.
To contrast with supercomputing, parallel tasks are typically nothomogeneous, but rather perform quite different computations based on acommon data set. The most successful class of parallel programs in thisspace is servers, where each task handles an individual independentsession with clients. Another successful example is using a task to runa user interface in parallel to the core application logic.
OpenMP is proving to be a popular way to provide parallelprogramming support for new machines. For example, the ARM MPcore hasan OpenMP implementation available, and Microsoft provides it for theirXbox 360 game console . Another trend is that virtualmachine-based languages like Java and C# have started to providesupport for managing parallelism as part of the virtual machinedefinition.
There are also parallel languages which go beyond the simplisticapproach in Ada, Java and C# to make parallelism an integrated featureof the language. OCCAM is designed for supercomputing, where specificcomputation parts can be specified to execute in parallel (usingconstructs such as parallel loops).
Erlang is an interesting language designed for control-planeapplications, using hundreds or thousands of tasks each with their ownlocal storage and communicating using message-passing. Tasks are usedas the primary means of program structuring — rather like objects areused on object-oriented programming.
Writing parallel programs seems to be easier where large data setsare being manipulated, like supercomputing and databases. Parallelizingprograms with smaller data sets is harder by experience. However, thereis one mitigating factor in the current move to multicoreimplementations: Onboard a multicore chip, communication is much fasterthan in a traditional multi-chip multiprocessor. This helps programmerswrite efficient programs, and should allow beneficial parallelizationof more types of programs.
In this series of articles, however, we will mainly consider thecase of using C or C++ with shared memory, as that is the model withthe most subtle debugging problems, and the one that most embeddedprogrammers will encounter when multicore processors replace oldsingle-core processors at the core of their products.
Overall, the move to multiprocessing constitutes a break with themainstream (and embedded) software-development tradition. Mostprogrammers have been taught how to program single-threaded programs.Few real-time operating systems support SMP processing. Debuggers (witha few exceptions) operate under the assumption that we are debugging aprogram that is serial, repeatable and self-contained.
Parallel processing per se is nothing new to embedded systems. Manyembedded applications are using parallel processing with great success.
For example, signal-processing in telecommunications makes use ofarrays of digital signal processors (DSPs) to handle wireless and wiredsignals. Routers use multiple input and output ports with multipletraffic engines to scale to extreme traffic volumes. Automotiveapplications use large numbers of small processors spread throughout avehicle. Control applications are distributed across multiple processorcards. Mobile phones contain a multitude of processors in parallel tohandle signal processing, applications, Bluetooth ports, etc.
The motivation for multiprocessor designs have mainly been toachieve required absolute performance, as well as cost and powerefficiency of the overall solution.
However, most such applications have used asymmetric multiprocessing(AMP), where each processor has its own local memory and performs workthat has been statically assigned to it. Each processor has its ownprivate set of tasks running (or tasks that can run), and running tasksdo not migrate between processors. The computations to be performednaturally divide into independent tasks and are “embarrassinglyparallel” .
For example, in a mobile phone the user interface can run inparallel to the radio interface, and a mobile phone base station hasone thread (or more) per active mobile terminal. Such parallelcomputation is easier to understand and design than generalshared-memory processing .
However, debugging an asymmetric parallel system is still harderthan debugging a single-processor system. Fundamentally, humans arepoor at thinking about parallel systems; we seem to be wired to handlea single flow of events better than multiple simultaneous flows.
The problem now is that the embedded systems software industry hasto figure out how to employ shared-memory symmetric multiprocessing tohandle applications that have been traditionally implemented on singleprocessors as single tasks.
Processor designers have assumed that delivering multicore as a wayto avoid the power implications of simply increasing the clock rate isacceptable and that software engineers will have no difficulty makinguse of processor power delivered in this way. While true for servers,it is not true for most applications of interest, and this creates thecurrent paradigm shift in the industry.
Seymour Cray says, “If you were plowing a field, which would yourather use: two strong oxen or 1,024 chickens?” It seems that we areforced to start harnessing these chickens.
In a shared-memory system, a multiprocessor runs a single instanceof an operating system, and tasks can execute on any processor (asdecided by the operating system scheduler). Shared-memory systems havemany advantages, particularly in balancing loads and migrating tasksbetween processors. It is more power-efficient to use severalprocessors running at a low clock frequency than a single processor ata high frequency. For example, the ARM11 MPCore multiprocessor variesboth the number of active cores and their clock frequency to process aworkload with maximum efficiency .
As illustrated in Figure 1 above ,we can expect future embedded systems to contain several shared-memorymulti-processor nodes (on individual boards or blades or cards),connected using a network. Each individual processor core in the systemhas a local level 1 cache, and shares level 2 cache (if present) andthe main memory and device set with the other cores on the same node.Within each SMP node, the caches are kept coherent so that the nodepresents a single unified memory from the perspective of the software,despite the memory being spread out in several processor-local caches.
The Software Breaks
Ignoring the issues of creating efficient parallel programs for themoment, even getting programs to function correctly in an SMPenvironment is harder than for a single processor. Existing softwarethat has been developed on a single processor might not work correctlywhen transitioned onto a multiprocessor. That an application workscorrectly in a multitasking environment does not imply that it workscorrectly in a multiprocessing environment; serious new issues occurwith true concurrency.
True parallel execution (or concurrency) makes it hard to establishthe precise order of events in different concurrent tasks. Thepropagation time of information between processors in shared-memorymultiprocessors is not zero (even if it is fairly short, maybe a fewhundred clock cycles at most), and this is sufficient to create subtlerandom variations in the execution, which can snowball. Amultiprocessor by nature exhibits chaotic behavior where a small changein initial parameters gives rise to large differences in system stateover time.
The system is actually inherently unpredictable, at least in termsof timing, and correct function can only be achieved by allowing forthis and designing code, which works even in what seems like bizarrecircumstances.
The following catalogue of problems attempts to highlight the manynew and interesting ways in which software can break on amultiprocessor.
LatentConcurrency Problems . There can be latent problems in anexisting, proven, multitasking workload that runs just fine on a singleprocessor. The presence of true concurrency makes mistakes inprotecting against concurrent accesses much more likely to trigger andcause program errors. As the timing of tasks becomes more variable, andthey run in parallel for longer periods of time, the task set is simplysubjected to more stress. This effect is similar to optimizing aprogram in a C compiler: optimizations might expose bugs in the programthat were previously hidden. The error was always there; it just didn’tmanifest itself.
MissingReentrancy. To make efficient use of a multiprocessor, all codethat is shared between multiple tasks has to support reentrantexecution. This means using locks to protect shared data and toallocate local data for each time a function is called. Shared statebetween invocations of the same function has to be avoided. In amulti-processor environment, actual occurrences of multiple tasks usingthe same shared function simultaneously will occur much more frequently(and thus trigger bugs. This effect is especially important foroperating systems and shared libraries, as such code will be usedheavily by multiple tasks.
Priorities donot Provide Mutual Exclusion. In application code written for atraditional single-processor, strictly priority-scheduled RTOS, acommon design pattern to protect shared data is to make all tasks thataccess the same data run at the same priority level.
With a strict priority-driven scheduler, each process will run tocompletion before the next process can run, giving exclusive accesswithout any locking overhead. This will fail when true concurrency isintroduced in a multiprocessor, as shown below in Figure 2 of a typicalcase:
This code is multitasking-safe on a single processor, but will breakon a multiprocessor. This means that even existing proven code willhave to be reviewed and tested before it can be assumed to runcorrectly on a multiprocessor. Explicit locking has to be introduced inorder to handle the access to shared data.
One solution proposed for maintaining the semantics ofsingle-processor priority scheduling on an SMP is to only run thetask(s) with highest priority. Thus, if there is only a singlehighest-priority task in a system, only one processor will be used andthe rest left idle. This ensures that high-priority tasks do not needto worry about simultaneous execution of lower-priority tasks, but doesnot solve the problem for tasks with the same priority when priority isused (incorrectly) as a serialization device.
Interrupts arenot Locks. In the operating system and device driver code, youcan no longer simply assume that you get exclusive access to shareddata by turning off all interrupts. Instead, SMP-safe lockingmechanisms have to be used. Redesigning the locking mechanisms in an OSkernel or driver (or user application, in case it makes use ofinterrupt management) is a major undertaking in the change tomultiprocessors, and getting an operating system to run efficiently onan SMP will take time .
Race Conditions . Raceconditions are situations where the outcome of a computation differsdepending on which participating task gets to a certain point first.They typically trigger when some piece of code takes longer thanexpected to execute or timing is disturbed in some other way. Sincerace conditions are inherently timing-related, they are among thehardest bugs to find. The name comes from the fact that the tasks areracing forward in parallel, and the result depends on who gets to acertain point first.
Shown on the left in Figure 3,above, is the classic data race, where a piece of data sharedbetween two tasks and the tasks do not protect the common data with alock. In this case, both tasks can be editing the data at the same timeand will not correctly account for the updates from the other task.
If task 1 was fast enough, it could finish its editing before task 2begins, but there are no guarantees for this. Note that shared data isoften a complex data structure, and the net result of a race is thatdifferent parts of the structure have been updated by different tasks,leading to an inconsistent data state.
On the right in Figure 3 aboveis another type of race, the message race, where one task is expectinga series of messages from other tasks. Such messages can come in anyorder, as there is no synchronization between the tasks; if task 2implicitly expects a certain order, it will get the wrong results iftask 3 happens to send before task 1. Whenever the ordering of eventsis important, explicit synchronization has to be in place.
Note that races occur in multitasking single-processor systems too,but there they are less likely to trigger, as they require a taskswitch to occur at an unlucky time. In a true concurrent system, raceswill happen more often, as discussed in earlier.
Deadlocks. Whenall shared data is protected by locks, you get into another situationwhere the locking itself can be the cause of errors. If two tasks aretaking multiple locks in different order, they can get stuck, bothwaiting for the other task to release the other lock. A simple butfairly typical example is shown in Figure 4 below:
Task T1 locks lock L1 first, which protects the variable V1. Afterawhile, T1 also needs to work on variable V2, protected by lock L2, andthus needs to lock L2 while still holding L1. This code in itself issound, as all accesses to shared data is correctly protected by locks.In task T2, work is being performed on variable V2, and lock L2 istaken. When calling the function foo(), V1 is also accessed, andlocking is in place.
Assume a scenario where T1 and T2 start at the same time, and manageto obtain one lock each. When T1 gets to the point lock(L2), it stopsand waits as T2 is holding that lock. Slightly later, T2 gets to thecall to function foo(), and duly tries to lock L1. At this point, weare stuck, as T1 and T2 are mutually waiting for the other task torelease a lock, which will never happen. We have a deadlock.
The example above illustrates a common cause of deadlocks: callinginto functions like foo(), which do locking invisible to the caller. Ina complex software system, it happens easily that locks get taken in abad order if no measures are taken to ensure consistent locking orders.
Partial Crashes. When a computation involves multiple tasks that compute different partsof an overall solution, it is possible that one (or more) of the taskscrash during the computation. This problem in a single task will thenescalate to a parallel problem, as the other tasks wait forever for thecrashed task to get back with its part of the result.
SilentParallelization Errors. Parallel code written with explicitcalls to threading libraries will naturally have to check whetherthread creation and other operations succeeded, but when using indirectparallel programming by compiler directives like OpenMP, error handlingis often suboptimal.
The compiler will create extra code in the application to starttasks and handle synchronization, and this code does not have a way toreport problems to the main user code. If a program fails to create thetasks it needs at run-time, it might just crash or hang or failsilently with no message to the user or programmer indicating aproblem. Such behavior makes it unsuitable for high-reliability codewhere error recovery is necessary.
Bad TimingAssumptions. Missing synchronization is a common theme inparallel programming bugs. One particular variant to be wary of is theassumption that one task is going to finish its next work long beforesome other task needs the results of this work. In essence, as a taskT1 is doing something short and simple, we expect that work to becompleted before a task T2 finishes something long and complicated, andwe know that they start out at the same time. A simple example is shownbelow in Figure 5:
Here, we expect T1 to finish writing V long before T2 is done withits initialization. However, it is possible that T2 in some situationcompletes its initialization before T1 can complete its write. Thissort of bug typically triggers under heavy load or unusual situationsin a shipping system.
Relaxed Memory Ordering WreaksHavoc
For performance reasons, a multiprocessor employs various forms ofrelaxed memory orders (also known as weak memory consistency models).Basically, the computer architecture specifically allows memoryoperations performed to be reordered in order to avoid stalling aprocessor when the memory system is slow.
In most such models, memory operations from one processor may beobserved in a different order from the viewpoint of another processor.This is necessary in order to get any efficiency out of amultiprocessor system. There are a number of relaxed memory orderings1,differing in how operations can bypass each other .
(Note that the potential operation reorderings allowed by a specificmemory consistency model are precisely defined, but that even so,understanding the implications of a particular memory consistency modelrequires deep considerations. University students usually find memoryconsistency the hardest part of computer architecture. )
Understanding relaxed memory orders is probably one of the hardestparts of understanding parallel computing , and unfortunately, theyare visible to a programmer since they affect data-transfer operationsbetween parallel threads. Many data exchange algorithms that arecorrect on multitasking single processor systems break when used onmultiprocessors.
For example, reads can almost always be performed out of order withrespect to the program, and sometimes writes might be seen in differentorders on different processors, especially when they originate fromdifferent processors.
In general, it is necessary to use special operations like memorybarriers to guarantee that all processors in a system have seen acertain memory operation and that the state is thus globallyconsistent. Failing to deal with memory consistency will result inintermittent timing-sensitive bugs caused by processors observing thesame execution in different orders.
One example is the classic Dekker locking algorithm shown in Figure 6 below , which is perfectlyvalid on a single processor (assuming that each read of a flag variableis atomic), no matter how tasks are interleaved.
Assuming that all variables are zero to begin with, it is quitepossible to read old values for x and y in task T2 despite the use of a“lock,” since there is no guarantee that the writes to variables x andy are seen by task T2 before the write to flag1. With a relaxed memoryordering, you cannot assume that writes complete in the same order asthey are specified in the program text.
For example, if the write to flag1 hit the local cache, and x and ymissed the cache, the net result could be that the write to flag1 takesa longer time to propagate. The execution is illustrated on the rightin the picture above (in which we assume that each task runs on its ownprocessor).
Note that an additional task 3 can see t