Applying the fundamentals of parallel programming to multiprocessor designs: Part 1
By Shameem Akhter and Jason Roberts, Intel Corp.
Parallel programming uses threads to enable multiple operations to
proceed simultaneously. The entire concept of
parallel programming
centers on the design, development, and deployment of threads within an
application and the coordination between threads and their respective
operations.
In Part 1 we will examine how to do multitasking
in a parallel
programming applications by breaking up programming tasks into chunks
that are suitable for threading. In Part 2, these techniques will be
applied to the apparently serial problem of error diffusion.
Designing for Threads
Developers who are unacquainted with parallel programming generally
feel comfortable with traditional programming models, such as
object-oriented programming (OOP).
In this case, a program begins at a defined point, such as the
main() function, and works through a series of tasks in succession. If
the program relies on user interaction, the main processing instrument
is a loop in which user events are handled.
From each allowed event - a button click, for example, the program
performs an established sequence of actions that ultimately ends with a
wait for the next user action.
When designing such programs, developers enjoy a relatively simple
programming world because only one thing is happening at any given
moment. If program tasks must be scheduled in a specific way, it's
because the developer imposes a certain order on the activities. At any
point in the process, one step generally flows into the next, leading
up to a predictable conclusion, based on predetermined parameters.
To move from this linear model to a parallel programming model,
designers must rethink the idea of process flow. Rather than being
constrained by a sequential execution sequence, programmers should
identify those activities that can be executed in parallel.
To do so, they must see their programs as a set of tasks with
dependencies between them. Breaking programs down into these individual
tasks and identifying dependencies is known as decomposition.
A problem may be decomposed in several ways: by task, by data, or
by data flow. Table 3.1 below
summarizes these forms of decomposition. As you shall see shortly,
these different forms of decomposition mirror different types of
programming activities.
 |
| Table
3.1 Summary of the Major Forms of Decomposition |
Task Decomposition
Decomposing a program by the functions that it performs is called task
decomposition. It is one of the simplest ways to achieve parallel
execution. Using this approach, individual tasks are catalogued.
If two of them can run concurrently, they are scheduled to do so by
the developer. Running tasks in parallel this way usually requires
slight modifications to the individual functions to avoid conflicts and
to indicate that these tasks are no longer sequential.
If we were discussing gardening, task decomposition would suggest
that gardeners be assigned tasks based on the nature of the activity:
if two gardeners arrived at a client's home, one might mow the lawn
while the other weeded. Mowing and weeding are separate functions
broken out as such.
To accomplish them, the gardeners would make sure to have some
coordination between them, so that the weeder is not sitting in the
middle of a lawn that needs to be mowed.
In programming terms, a good example of task decomposition is word
processing software, such as Microsoft Word. When the user opens a very
long document, he or she can begin entering text right away. While the
user enters text, document pagination occurs in the background, as one
can readily see by the quickly increasing page count that appears in
the status bar.
Text entry and pagination are two separate tasks that its
programmers broke out by function to run in parallel. Had programmers
not designed it this way, the user would be obliged to wait for the
entire document to be paginated before being able to enter any text.
Many of you probably recall that this wait was common on early PC word
processors.
Data Decomposition
Data
decomposition, also known as data-level
parallelism,
breaks down tasks by the data they work on rather than by the nature of
the task. Programs that are broken down via data decomposition
generally have many threads performing the same work, just on different
data items.
For example, consider recalculating the values in a large
spreadsheet. Rather than have one thread perform all the calculations,
data decomposition would suggest having two threads, each performing
half the calculations, or n threads performing 1/nth the work.
If the gardeners used the principle of data decomposition to divide
their work, they would both mow half the property and then both weed
half the flower beds. As in computing, determining which form of
decomposition is more effective depends a lot on the constraints of the
system.
For example, if the area to mow is so small that it does not need
two mowers, that task would be better done by just one gardener - that
is, task decomposition is the best choice - and data decomposition
could be applied to other task sequences, such as when the mowing is
done and both gardeners begin weeding in parallel.
As the number of processor cores increases, data decomposition
allows the problem size to be increased. This allows for more work to
be done in the same amount of time.
To illustrate, consider the gardening example. Two more gardeners
are added to the work crew. Rather than assigning all four gardeners to
one yard, we can we can assign the two new gardeners to another yard,
effectively increasing our total problem size.
Assuming that the two new gardeners can perform the same amount of
work as the original two, and that the two yard sizes are the same,
we've doubled the amount of work done in the same amount of time.
Data Flow Decomposition
Many times, when decomposing a problem, the critical issue isn't what
tasks should do the work, but how the data flows between the different
tasks. In these cases, data flow decomposition breaks up a problem by
how data flows between tasks.
The producer/consumer problem is a well known example of how data
flow impacts a programs ability to execute in parallel. Here, the
output of one task, the producer, becomes the input to another, the
consumer. The two tasks are performed by different threads, and the
second one, the consumer, cannot start until the producer finishes some
portion of its work.
Using the gardening example, one gardener prepares the tools - that
is, he puts gas in the mower, cleans the shears, and other similar
tasks - for both gardeners to use. No gardening can occur until this
step is mostly finished, at which point the true gardening work can
begin.
The delay caused by the first task creates a pause for the second
task, after which both tasks can continue in parallel. In computer
terms, this particular model occurs frequently.
In common programming tasks, the producer/consumer problem occurs
in several typical scenarios. For example, programs that must rely on
the reading of a file fit this scenario: the results of the file I/O
become the input to the next step, which might be threaded.
However, that step cannot begin until the reading is either
complete or has progressed sufficiently for other processing to kick
off. Another common programming example is parsing: an input file must
be parsed, or analyzed semantically, before the back-end activities,
such as code generation in a compiler, can begin.
The producer/consumer problem has several interesting dimensions:
1) The
dependence created between consumer and producer can cause significant
delays if this model is not implemented correctly. A
performance-sensitive design seeks to understand the exact nature of
the dependence and diminish the delay it imposes. It also aims to avoid
situations in which consumer threads are idling while waiting for
producer threads.
2) In the ideal
scenario, the hand-off between producer and consumer is completely
clean, as in the example of the file parser. The output is
context-independent and the consumer has no need to know anything about
the producer. Many times, however, the producer and consumer components
do not enjoy such a clean division of labor, and scheduling their
interaction requires careful planning.
3) If the
consumer is finishing up while the producer is completely done, one
thread remains idle while other threads are busy working away. This
issue violates an important objective of parallel processing, which is
to balance loads so that all available threads are kept busy. Because
of the logical relationship between these threads, it can be very
difficult to keep threads equally occupied.
Next, we'll take a look at the pipeline pattern that allows
developers to solve the producer/consumer problem in a scalable
fashion.
Implications of Different
Decompositions
Different decompositions provide different benefits. If the goal, for
example, is ease of programming and tasks can be neatly partitioned by
functionality, then task decomposition is more often than not the
winner.
Data decomposition adds some additional code-level complexity to
tasks, so it is reserved for cases where the data is easily divided and
performance is important.
The most common reason for threading an application is performance.
And in this case, the choice of decompositions is more difficult. In
many instances, the choice is dictated by the problem domain: some
tasks are much better suited to one type of decomposition.
But some tasks have no clear bias. Consider for example, processing
images in a video stream. In formats with no dependency between frames,
you'll have a choice of decompositions. Should they choose task
decomposition, in which one thread does decoding, another color
balancing, and so on, or data decomposition, in which each thread does
all the work on one frame and then moves on to the next?
To return to the analogy of the gardeners, the decision would take
this form: If two gardeners need to mow two lawns and weed two flower
beds, how should they proceed? Should one gardener only mow - that is,
they choose task based decomposition - or should both gardeners mow
together then weed together?
In some cases, the answer emerges quickly - for instance when a
resource constraint exists, such as only one mower. In others where
each gardener has a mower, the answer comes only through careful
analysis of the constituent activities. In the case of the gardeners,
task decomposition looks better because the start-up time for mowing is
saved if only one mower is in use.
Ultimately, you determine the right answer for your application's
use of parallel programming by careful planning and testing. The
empirical timing and evaluation plays a more significant role in the
design choices you make in parallel programming than it does in
standard single-threaded programming.
Challenges You'll Face
The use of threads enables you to improve performance significantly by
allowing two or more activities to occur simultaneously. However,
developers cannot fail to recognize that threads add a measure of
complexity that requires thoughtful consideration to navigate
correctly.
This complexity arises from the inherent fact that more than one
activity is occurring in the program. Managing simultaneous activities
and their possible interaction leads you to confronting four types of
problems:
1) Synchronization
is the process by which two or more threads coordinate their
activities. For example, one thread waits for another to finish a task
before continuing.
2) Communication
refers to the bandwidth and latency issues associated with exchanging
data between threads.
3) Load balancing
refers to the distribution of work across multiple threads so that they
all perform roughly the same amount of work.
4) Scalability
is
the challenge of making efficient use of a larger number of threads
when software is run on more-capable systems. For example, if a program
is written to make good use of four processor cores, will it scale
properly when run on a system with eight processor cores?
Each of these issues must be handled carefully to maximize
application performance. Subsequent chapters describe many aspects of
these problems and how best to address them on multi-core systems.
Parallel Programming Patterns
For years object-oriented programmers have been using design patterns
to logically design their applications. Parallel programming is no
different than object-oriented programming - parallel programming
problems generally fall into one of several well known patterns.
A few of the more common parallel programming patterns and their
relationship to the aforementioned decompositions are shown in Table 3.2 below.
 |
| Table
3.2 Common Parallel Programming Patterns |
In this section, we'll provide a brief overview of each pattern and
the types of problems that each pattern may be applied to, including:
1) Task-level
Parallelism Pattern. In many cases, the best way to achieve
parallel execution is to focus directly on the tasks themselves. In
this case, the task-level parallelism pattern makes the most sense. In
this pattern, the problem is decomposed into a set of tasks that
operate independently.
It is often necessary remove dependencies between tasks or separate
dependencies using replication. Problems that fit into this pattern
include the so-called embarrassingly parallel problems, those where
there are no dependencies between threads, and replicated data
problems, those where the dependencies between threads may be removed
from the individual threads.
2) Divide and
Conquer Pattern. In the divide and conquer pattern, the problem
is divided into a number of parallel sub-problems. Each sub-problem is
solved independently. Once each sub-problem is solved, the results are
aggregated into the final solution. Since each sub-problem can be
independently solved, these sub-problems may be executed in a parallel
fashion.
The divide and conquer approach is widely used on sequential
algorithms such as merge sort. These algorithms are very easy to
parallelize. This pattern typically does a good job of load balancing
and exhibits good locality; which is important for effective cache
usage.
3) Geometric
Decomposition Pattern. The geometric decomposition pattern is
based on the parallelization of the data structures used in the problem
being solved. In geometric decomposition, each thread is responsible
for operating on data 'chunks'. This pattern may be applied to problems
such as heat flow and wave propagation.
4) Pipeline
Pattern. The idea behind the pipeline pattern is identical to
that of an assembly line. The way to find concurrency here is to break
down the computation into a series of stages and have each thread work
on a different stage simultaneously.
5) Wavefront
Pattern. The wavefront pattern is useful when processing data
elements along a diagonal in a two-dimensional grid. This is shown in Figure 3.1 below.
 |
| Figure
3.1 Wavefront Data Access Pattern |
The numbers in Figure 3.1 above illustrate the order in which
the data elements are processed. For example, elements in the diagonal
that contains the number "3" are dependent on data elements "1" and "2"
being processed previously.
The shaded data elements in Figure
3.1 indicate data that has already been processed. In this
pattern, it is critical to minimize the idle time spent by each thread.
Load balancing is the key to success with this pattern.
For a more extensive and thorough look at parallel programming
design patterns, refer to the book Patterns
for Parallel Programming (Mattson 2004).
Next in Part 2: A motivating
problem - Error Diffusion
Used
with the permission of the publisher, Intel Press, this series of two
articles is based on copyrighted material from "Multicore
Programming," by Shameem Akhter and Jason Roberts. This book can be
purchased on line.
Shameem Akhter is a platform
architect at Intel Corp., focusing on single
socket multicore architecture and performance analysis. Jason Roberts
is a senior software engineer at Intel, working on a number of
different multithreaded software products ranging from desktop,
handheld and embedded DSP platforms.