Since the early 1980s, embedded software developers have been taking
advantage of the steady progress of processor performance. You could
count on new processors being made available that would increase the
performance of the system without disturbing the software structure
overly much. If you used a 500 MHz processor last year, you could buy a
1000 MHz processor next year. You would get a doubling in your
application performance without any other work than designing in a new,
software-compatible chip.
This performance increase was often factored into application and
system design: The software designers planned and implemented
application sets that were too heavy for the currently available
processors, counting on a new, faster processor to come on the market
and solve their performance problems before the product got to market.
Such a strategy was necessary to stay competitive.
This comfortable state was driven by the steady progress of the
semiconductor industry that kept packing more transistors into smaller
packages at higher clock frequencies, enabling performance to increase
by both architectural innovations like more parallel pipelines, by
adding resources like on-chip caches and memory controllers, and by
increasing the clock frequency.
The 32-bit PowerPC family offers a typical case study on this
progression. Starting with a 603 processor, users have been able to
upgrade with full application-software compatibility to the “G3”/750
processor series (rising from 166 to 1100 MHz over time). Next came the
“G4”/7400 series, and then the 64-bit “G5”/970 series, continuing to
provide software engineers a performance increase in a single
processor.
However, in 2004, it became clear that the progress in
single-processor performance began slowing considerably. Due to a
number of chip design and manufacturing issues, the clock-frequency
increase slowed down, and we could not expect to get twice-as-fast
processing from clock speed increases. Instead, the semiconductor
industry turned to parallelism to increase performance.
Using multiple processor cores (known as a multicore) on the same
chip, the theoretical performance per chip can increase dramatically,
even if the performance per core is only improving slowly. Also, the
power efficiency of a multicore implementation is much better than
traditional single-core implementations, which is a factor as important
as absolute performance [1].
Consequently, every high-performance processor family is moving to
multiprocessor designs. Freescale has announced the PowerPC 8641D, with
two G4-class cores on one chip. IBM has the PowerPC 970MP, with two G5
cores. Intel is pitching dual-core Pentium-M machines to the embedded
market, and PMC-Sierra has been selling dual-core RM9200 MIPS64
processors for some time. Cavium has announced up to 16-way parallel
MIPS64-based processors. Parallel machines are here in force.
The move to parallel hardware, however, creates problems for
software developers. Applications that have traditionally used single
processors will now have to be parallelized over multiple processors in
order to take advantage of the multiple cores. Programs that used to
run on single processors will have to run on multiprocessors. Apart
from the act of creating parallel software, debugging is greatly
complicated by the combination of parallel execution and tighter
packaging [2][8].
This will be the greatest challenge to the embedded software
industry for a very long time. Both tools and thinking need to change.
The level of conceptual change is comparable to the introduction of
multitasking operating systems, virtual memory and object-oriented
programming.
Multi-what?
Multiprocessing systems can be built in a number of ways, and there are
several terms referring to various forms of packaging and implementing
multiprocessor systems.
1) Multitasking means that
a single processor can run several different software tasks, scheduled
by a (real-time) operating system.
2) A multiprocessor is any
computer system using more than one processor to perform its work.
3) A shared-memory or
symmetric multiprocessor (SMP) is a design where the processors in a
multiprocessor share the same memory, and all can access the same data
and run the same tasks. In contrast, an asymmetric multiprocessor (AMP)
uses several processors with their own local memories, often using
different architectures.
4) A multicore processor is
a single processor chip containing multiple processor cores, which can
be identical (as on the Freescale MPC8641D) or of different types (as
on a Texas Instruments OMAP). Essentially, it is a multiprocessor on a
single chip. There is some debate on precisely what deserves to be
called multicore [4], but in this paper we suppose that anything with
more than one processing core on a single chip qualifies.
5) Hardware/symmetric
multithreading (S/MT) is a technique where a single processor core is
used to execute several parallel threads of computation. This provides
less performance than a true multicore design, but also has a much
lower implementation cost. Examples include the IBM Power5 and Intel
Pentium4 HyperThreading.
6) A chip multiprocessor
(CMP) is a processor employing multithreading and/or multicore
techniques to create a parallel computer on a single chip. An example
is the Sun “Niagara” UltraSparc T1. CMP designs are typically SMPs, as
they mainly come from the mainstream PC/server market.
In the rest of this set of articles, we will use the term
“multiprocessor” to denote any computer system containing more than one
thread of control in hardware.
Software
Structure. On the software side, we prefer to use the generic
term “task” to mean a single thread of computation. We want to avoid
confusion between “process” and “processor.” A number of tasks can
share memory in order to implement an application, and this is really
what is meant by a parallel program: A number of tasks cooperate to
perform the function of a particular application.
Many operating systems use the term “process” to mean a single
memory space in which one or more “threads” can run. Between processes,
memory is not shared. Commonly, there is only one thread of computation
inside a process, and people tend to use “process” to mean a single
thread of execution.
Programming Parallel Machines
There are several ways in which to write code to run on a
multiprocessor machine. The most important distinction is between
message passing and shared memory styles. In message-passing,
communication is explicitly programmed, while shared-memory makes
communication implicit in the reading and writing of variables. Note
that message-passing programming can make sense even on a shared-memory
machine, where messages are implemented using shared memory.
Classic supercomputing (weather prediction, astronomy simulations,
airplane design) typically uses sequential languages like C or FORTRAN
with a library (pthreads) or compiler directives (OpenMP, MPI) to
divide a computation task up into parallel tasks. Shared memory and
message passing are both in use, depending on the nature of the
machines used.
This programming tradition usually creates programs that use a
number of tasks in parallel to solve a small part of a single
computation problem. Typically, parallelism is applied to single kernel
loops. This model works well for the target domain, and supercomputing
computations have been successfully parallelized to use 10,000s of
processors in parallel.
In the desktop and server market, commercial multitasking and
multiprocessing workloads are programmed using sequential languages (C,
C++, Ada, Java, COBOL), using libraries, OS APIs or language support.
Synchronization and communication can be provided as part of the
language (Java and Ada), the operating system API, a library like
pthreads or compiler directives like OpenMP and MPI.
To contrast with supercomputing, parallel tasks are typically not
homogeneous, but rather perform quite different computations based on a
common data set. The most successful class of parallel programs in this
space is servers, where each task handles an individual independent
session with clients. Another successful example is using a task to run
a user interface in parallel to the core application logic.
OpenMP is proving to be a popular way to provide parallel
programming support for new machines. For example, the ARM MPcore has
an OpenMP implementation available, and Microsoft provides it for their
Xbox 360 game console [6][16]. Another trend is that virtual
machine-based languages like Java and C# have started to provide
support for managing parallelism as part of the virtual machine
definition.
There are also parallel languages which go beyond the simplistic
approach in Ada, Java and C# to make parallelism an integrated feature
of the language. OCCAM is designed for supercomputing, where specific
computation parts can be specified to execute in parallel (using
constructs such as parallel loops).
Erlang is an interesting language designed for control-plane
applications, using hundreds or thousands of tasks each with their own
local storage and communicating using message-passing. Tasks are used
as the primary means of program structuring — rather like objects are
used on object-oriented programming.
Writing parallel programs seems to be easier where large data sets
are being manipulated, like supercomputing and databases. Parallelizing
programs with smaller data sets is harder by experience. However, there
is one mitigating factor in the current move to multicore
implementations: Onboard a multicore chip, communication is much faster
than in a traditional multi-chip multiprocessor. This helps programmers
write efficient programs, and should allow beneficial parallelization
of more types of programs.
In this series of articles, however, we will mainly consider the
case of using C or C++ with shared memory, as that is the model with
the most subtle debugging problems, and the one that most embedded
programmers will encounter when multicore processors replace old
single-core processors at the core of their products.
Overall, the move to multiprocessing constitutes a break with the
mainstream (and embedded) software-development tradition. Most
programmers have been taught how to program single-threaded programs.
Few real-time operating systems support SMP processing. Debuggers (with
a few exceptions) operate under the assumption that we are debugging a
program that is serial, repeatable and self-contained.
Embedded Multiprocessing
Parallel processing per se is nothing new to embedded systems. Many
embedded applications are using parallel processing with great success.
For example, signal-processing in telecommunications makes use of
arrays of digital signal processors (DSPs) to handle wireless and wired
signals. Routers use multiple input and output ports with multiple
traffic engines to scale to extreme traffic volumes. Automotive
applications use large numbers of small processors spread throughout a
vehicle. Control applications are distributed across multiple processor
cards. Mobile phones contain a multitude of processors in parallel to
handle signal processing, applications, Bluetooth ports, etc.
The motivation for multiprocessor designs have mainly been to
achieve required absolute performance, as well as cost and power
efficiency of the overall solution.
However, most such applications have used asymmetric multiprocessing
(AMP), where each processor has its own local memory and performs work
that has been statically assigned to it. Each processor has its own
private set of tasks running (or tasks that can run), and running tasks
do not migrate between processors. The computations to be performed
naturally divide into independent tasks and are “embarrassingly
parallel” [12][13].
For example, in a mobile phone the user interface can run in
parallel to the radio interface, and a mobile phone base station has
one thread (or more) per active mobile terminal. Such parallel
computation is easier to understand and design than general
shared-memory processing [3].
However, debugging an asymmetric parallel system is still harder
than debugging a single-processor system. Fundamentally, humans are
poor at thinking about parallel systems; we seem to be wired to handle
a single flow of events better than multiple simultaneous flows.
The problem now is that the embedded systems software industry has
to figure out how to employ shared-memory symmetric multiprocessing to
handle applications that have been traditionally implemented on single
processors as single tasks.
Processor designers have assumed that delivering multicore as a way
to avoid the power implications of simply increasing the clock rate is
acceptable and that software engineers will have no difficulty making
use of processor power delivered in this way. While true for servers,
it is not true for most applications of interest, and this creates the
current paradigm shift in the industry.
Seymour Cray says, “If you were plowing a field, which would you
rather use: two strong oxen or 1,024 chickens?” It seems that we are
forced to start harnessing these chickens.
In a shared-memory system, a multiprocessor runs a single instance
of an operating system, and tasks can execute on any processor (as
decided by the operating system scheduler). Shared-memory systems have
many advantages, particularly in balancing loads and migrating tasks
between processors. It is more power-efficient to use several
processors running at a low clock frequency than a single processor at
a high frequency. For example, the ARM11 MPCore multiprocessor varies
both the number of active cores and their clock frequency to process a
workload with maximum efficiency [5].
 |
| Figure
1 |
As illustrated in Figure 1 above,
we can expect future embedded systems to contain several shared-memory
multi-processor nodes (on individual boards or blades or cards),
connected using a network. Each individual processor core in the system
has a local level 1 cache, and shares level 2 cache (if present) and
the main memory and device set with the other cores on the same node.
Within each SMP node, the caches are kept coherent so that the node
presents a single unified memory from the perspective of the software,
despite the memory being spread out in several processor-local caches.
The Software Breaks
Ignoring the issues of creating efficient parallel programs for the
moment, even getting programs to function correctly in an SMP
environment is harder than for a single processor. Existing software
that has been developed on a single processor might not work correctly
when transitioned onto a multiprocessor. That an application works
correctly in a multitasking environment does not imply that it works
correctly in a multiprocessing environment; serious new issues occur
with true concurrency.
True parallel execution (or concurrency) makes it hard to establish
the precise order of events in different concurrent tasks. The
propagation time of information between processors in shared-memory
multiprocessors is not zero (even if it is fairly short, maybe a few
hundred clock cycles at most), and this is sufficient to create subtle
random variations in the execution, which can snowball. A
multiprocessor by nature exhibits chaotic behavior where a small change
in initial parameters gives rise to large differences in system state
over time.
The system is actually inherently unpredictable, at least in terms
of timing, and correct function can only be achieved by allowing for
this and designing code, which works even in what seems like bizarre
circumstances.
The following catalogue of problems attempts to highlight the many
new and interesting ways in which software can break on a
multiprocessor.
Latent
Concurrency Problems. There can be latent problems in an
existing, proven, multitasking workload that runs just fine on a single
processor. The presence of true concurrency makes mistakes in
protecting against concurrent accesses much more likely to trigger and
cause program errors. As the timing of tasks becomes more variable, and
they run in parallel for longer periods of time, the task set is simply
subjected to more stress. This effect is similar to optimizing a
program in a C compiler: optimizations might expose bugs in the program
that were previously hidden. The error was always there; it just didn’t
manifest itself.
Missing
Reentrancy. To make efficient use of a multiprocessor, all code
that is shared between multiple tasks has to support reentrant
execution. This means using locks to protect shared data and to
allocate local data for each time a function is called. Shared state
between invocations of the same function has to be avoided. In a
multi-processor environment, actual occurrences of multiple tasks using
the same shared function simultaneously will occur much more frequently
(and thus trigger bugs. This effect is especially important for
operating systems and shared libraries, as such code will be used
heavily by multiple tasks.
Priorities do
not Provide Mutual Exclusion. In application code written for a
traditional single-processor, strictly priority-scheduled RTOS, a
common design pattern to protect shared data is to make all tasks that
access the same data run at the same priority level.
With a strict priority-driven scheduler, each process will run to
completion before the next process can run, giving exclusive access
without any locking overhead. This will fail when true concurrency is
introduced in a multiprocessor, as shown below in Figure 2 of a typical
case:
 |
| Figure
2 |
This code is multitasking-safe on a single processor, but will break
on a multiprocessor. This means that even existing proven code will
have to be reviewed and tested before it can be assumed to run
correctly on a multiprocessor. Explicit locking has to be introduced in
order to handle the access to shared data.
One solution proposed for maintaining the semantics of
single-processor priority scheduling on an SMP is to only run the
task(s) with highest priority. Thus, if there is only a single
highest-priority task in a system, only one processor will be used and
the rest left idle. This ensures that high-priority tasks do not need
to worry about simultaneous execution of lower-priority tasks, but does
not solve the problem for tasks with the same priority when priority is
used (incorrectly) as a serialization device.
Interrupts are
not Locks. In the operating system and device driver code, you
can no longer simply assume that you get exclusive access to shared
data by turning off all interrupts. Instead, SMP-safe locking
mechanisms have to be used. Redesigning the locking mechanisms in an OS
kernel or driver (or user application, in case it makes use of
interrupt management) is a major undertaking in the change to
multiprocessors, and getting an operating system to run efficiently on
an SMP will take time [9][10].
Race Conditions. Race
conditions are situations where the outcome of a computation differs
depending on which participating task gets to a certain point first.
They typically trigger when some piece of code takes longer than
expected to execute or timing is disturbed in some other way. Since
race conditions are inherently timing-related, they are among the
hardest bugs to find. The name comes from the fact that the tasks are
racing forward in parallel, and the result depends on who gets to a
certain point first.
 |
| Figure
3 |
Shown on the left in Figure 3,
above, is the classic data race, where a piece of data shared
between two tasks and the tasks do not protect the common data with a
lock. In this case, both tasks can be editing the data at the same time
and will not correctly account for the updates from the other task.
If task 1 was fast enough, it could finish its editing before task 2
begins, but there are no guarantees for this. Note that shared data is
often a complex data structure, and the net result of a race is that
different parts of the structure have been updated by different tasks,
leading to an inconsistent data state.
On the right in Figure 3 above
is another type of race, the message race, where one task is expecting
a series of messages from other tasks. Such messages can come in any
order, as there is no synchronization between the tasks; if task 2
implicitly expects a certain order, it will get the wrong results if
task 3 happens to send before task 1. Whenever the ordering of events
is important, explicit synchronization has to be in place.
Note that races occur in multitasking single-processor systems too,
but there they are less likely to trigger, as they require a task
switch to occur at an unlucky time. In a true concurrent system, races
will happen more often, as discussed in earlier.
Deadlocks. When
all shared data is protected by locks, you get into another situation
where the locking itself can be the cause of errors. If two tasks are
taking multiple locks in different order, they can get stuck, both
waiting for the other task to release the other lock. A simple but
fairly typical example is shown in Figure 4 below:
 |
| Figure
4 |
Task T1 locks lock L1 first, which protects the variable V1. After
awhile, T1 also needs to work on variable V2, protected by lock L2, and
thus needs to lock L2 while still holding L1. This code in itself is
sound, as all accesses to shared data is correctly protected by locks.
In task T2, work is being performed on variable V2, and lock L2 is
taken. When calling the function foo(), V1 is also accessed, and
locking is in place.
Assume a scenario where T1 and T2 start at the same time, and manage
to obtain one lock each. When T1 gets to the point lock(L2), it stops
and waits as T2 is holding that lock. Slightly later, T2 gets to the
call to function foo(), and duly tries to lock L1. At this point, we
are stuck, as T1 and T2 are mutually waiting for the other task to
release a lock, which will never happen. We have a deadlock.
The example above illustrates a common cause of deadlocks: calling
into functions like foo(), which do locking invisible to the caller. In
a complex software system, it happens easily that locks get taken in a
bad order if no measures are taken to ensure consistent locking orders.
Partial Crashes.
When a computation involves multiple tasks that compute different parts
of an overall solution, it is possible that one (or more) of the tasks
crash during the computation. This problem in a single task will then
escalate to a parallel problem, as the other tasks wait forever for the
crashed task to get back with its part of the result.
Silent
Parallelization Errors. Parallel code written with explicit
calls to threading libraries will naturally have to check whether
thread creation and other operations succeeded, but when using indirect
parallel programming by compiler directives like OpenMP, error handling
is often suboptimal.
The compiler will create extra code in the application to start
tasks and handle synchronization, and this code does not have a way to
report problems to the main user code. If a program fails to create the
tasks it needs at run-time, it might just crash or hang or fail
silently with no message to the user or programmer indicating a
problem. Such behavior makes it unsuitable for high-reliability code
where error recovery is necessary.
Bad Timing
Assumptions. Missing synchronization is a common theme in
parallel programming bugs. One particular variant to be wary of is the
assumption that one task is going to finish its next work long before
some other task needs the results of this work. In essence, as a task
T1 is doing something short and simple, we expect that work to be
completed before a task T2 finishes something long and complicated, and
we know that they start out at the same time. A simple example is shown
below in Figure 5:
 |
| Figure
5 |
Here, we expect T1 to finish writing V long before T2 is done with
its initialization. However, it is possible that T2 in some situation
completes its initialization before T1 can complete its write. This
sort of bug typically triggers under heavy load or unusual situations
in a shipping system.
Relaxed Memory Ordering Wreaks
Havoc
For performance reasons, a multiprocessor employs various forms of
relaxed memory orders (also known as weak memory consistency models).
Basically, the computer architecture specifically allows memory
operations performed to be reordered in order to avoid stalling a
processor when the memory system is slow.
In most such models, memory operations from one processor may be
observed in a different order from the viewpoint of another processor.
This is necessary in order to get any efficiency out of a
multiprocessor system. There are a number of relaxed memory orderings1,
differing in how operations can bypass each other [14].
(Note that the potential operation reorderings allowed by a specific
memory consistency model are precisely defined, but that even so,
understanding the implications of a particular memory consistency model
requires deep considerations. University students usually find memory
consistency the hardest part of computer architecture. )
Understanding relaxed memory orders is probably one of the hardest
parts of understanding parallel computing [13], and unfortunately, they
are visible to a programmer since they affect data-transfer operations
between parallel threads. Many data exchange algorithms that are
correct on multitasking single processor systems break when used on
multiprocessors.
For example, reads can almost always be performed out of order with
respect to the program, and sometimes writes might be seen in different
orders on different processors, especially when they originate from
different processors.
In general, it is necessary to use special operations like memory
barriers to guarantee that all processors in a system have seen a
certain memory operation and that the state is thus globally
consistent. Failing to deal with memory consistency will result in
intermittent timing-sensitive bugs caused by processors observing the
same execution in different orders.
One example is the classic Dekker locking algorithm shown in Figure 6 below, which is perfectly
valid on a single processor (assuming that each read of a flag variable
is atomic), no matter how tasks are interleaved.
 |
| Figure
6 |
Assuming that all variables are zero to begin with, it is quite
possible to read old values for x and y in task T2 despite the use of a
“lock,” since there is no guarantee that the writes to variables x and
y are seen by task T2 before the write to flag1. With a relaxed memory
ordering, you cannot assume that writes complete in the same order as
they are specified in the program text.
For example, if the write to flag1 hit the local cache, and x and y
missed the cache, the net result could be that the write to flag1 takes
a longer time to propagate. The execution is illustrated on the right
in the picture above (in which we assume that each task runs on its own
processor).
Note that an additional task 3 can see the writes in the “expected”
order – which is the common case. The rare case is the problem case.
The programming solution is to put in barriers, forcing memory writes
and reads to complete before the tasks continue beyond the lock. The
cost is in performance, as processors stall when memory operations run
to completion.
Also note that using the C keyword “volatile” has no effect on
memory consistency; it does guarantee that the compiled code writes
variable values back to memory, but provides no guarantees as to when
other processors will see the change. This is a symptom of a general
problem in that most language definitions do not consider the
implications of multiprocessors.
A more complex example that fails for the same reason is the
following scheduling code from the SPLASH program in Figure 7 below:
 |
| Figure
7 |
Here, the worker tasks T2…Tn wait for the linked lists of work units
to become available from the master thread T1, and then grab work units
off of it protected by a lock. However, considering the possibility of
write reordering, the work unit data read by a worker task need not be
complete just because the linked list has been updated. And if the
locking is not multiprocessor-aware, the whole task queue can be
destroyed by processors having a different idea of the value of Head.
To read Part 2 in this series go to
"Debugging parallel programs."
This article is
excerpted from a paper of the same name
presented at the Embedded Systems Conference Silicon Valley 2006. Used
with permission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.
Jakob Engblom is Business
Development Manager at Virtutech
and is also an adjunct professor at the department of Information
Technology at Uppsala university. He has been speaking about embedded
systems programming at university courses and industry trade shows for
the past five years. His interests include embedded systems
programming, simulation technology, real-time systems, and compiler
technology.