By Shameem Akhter and Jason Roberts, Intel Corp.
The major CPU vendors are shifting gears, choosing to add parallelism
support on-chip with multi-core processors in order to avoid many of
the technological hurdles in boosting speeds, while still offering a
better performing processor.
However, if your software does not take advantage of these multiple
cores, it may not run any faster. That is where OpenMP
plays a key role by providing an easy method for threading applications
without burdening the programmer with the complications of creating,
synchronizing, load balancing, and destroying threads.
The OpenMP standard was formulated in 1997 as an API for writing portable,
multithreaded applications. It started as a Fortran-based standard, but
later
grew to include C and C++. The current version is OpenMP Version
2.5,
which supports Fortran, C, and C++.
Intel C++ and Fortran compilers support the OpenMP Version 2.5
standard. The OpenMP programming model provides a platform-independent
set of compiler pragmas, directives, function calls, and environment
variables that explicitly instruct the compiler how and where to use
parallelism in the application.
Many loops can be threaded by inserting only one pragma right before
the loop, as demonstrated by examples in this chapter. By leaving the
nitty-gritty details to the compiler and OpenMP runtime library, you
can spend more time determining which loops should be threaded and how
to best restructure the algorithms for performance on multi-core
processors. The full potential of OpenMP is realized when it is used to
thread the most time-consuming loops; that is, the hot spots.
Tackling the topic of OpenMP is an intimidating task. Therefore,
this series of articles serves as a bridge for you, allowing you to
reach a point where you have a fundamental understanding of threading
with OpenMP from which you can build your broader practical knowledge.
The power and simplicity of OpenMP can be demonstrated by looking at
an example. The following loop converts each 32-bit RGB (red, green,
blue) pixel in an array into an 8-bit grayscale pixel. The one pragma,
which has been inserted immediately before the loop, is all that is
needed for parallel execution under OpenMP.
Let's take a closer look at the loop. First, the example uses
work-sharing, which is the general term that OpenMP uses to describe
distributing work across threads. When work-sharing is used with the
for construct, as shown in this example, the iterations of the loop are
distributed among multiple threads.
The OpenMP implementation determines how many threads to create and
how best to manage them. All the programmer needs to do is to tell
OpenMP which loop should be threaded. No need for programmers to add a
lot of codes for creating, initializing, managing, and killing threads
in order to exploit parallelism.
OpenMP compiler and runtime library take care of these and many
other details behind the scenes. In the current OpenMP specification
Version 2.5, OpenMP places the following five restrictions on which
loops can be threaded:
1) The loop variable must
be of type signed integer. Unsigned integers will not work. (Note: this restriction is to be removed in
the future OpenMP specification Version 3.0).
2) The comparison operation
must be in the form loop_variable <, <=, >, or >=
loop_invariant_integer.
3) The third
expression or increment portion of the for loop must be either integer
addition or integer subtraction and by a loop-invariant value.
4) If the comparison
operation is < or <=, the loop variable must increment on every
iteration; conversely, if the comparison operation is > or >=,
the loop variable must decrement on every iteration.
5) The loop must be a
single entry and single exit loop, meaning no jumps from the inside of
the loop to the outside or outside to the inside are permitted with the
exception of the exit statement, which terminates the whole
application. If the statements goto or break are used, they must jump
within the loop, not outside it. The same goes for exception handling;
exceptions must be caught within the loop.
Although these restrictions may sound somewhat limiting, most loops
can easily be rewritten to conform to them. The restrictions listed
above must be observed so that the compiler can parallelize loops via
OpenMP. However, even when the compiler parallelizes the loop, you must
still ensure the loop is functionally correct by watching out for the
issues in the next section.
Challenges in Threading a
parallized loop
Threading a loop is to convert independent loop iterations to threads
and run these threads in parallel. In some sense, this is a re-ordering
transformation in which the original order of loop iterations can be
converted to into an undetermined order.
In addition, because the loop body is not an atomic operation,
statements in the two different iterations may run simultaneously. In
theory, it is valid to convert a sequential loop to a threaded loop if
the loop carries no dependence.
Therefore, the first challenge for you
is to identify or restructure the hot loop to make sure that it has no
loop-carried dependence before adding OpenMP pragmas.
Loop-carried
Dependence.
Even if the loop meets all five loop criteria and the compiler threaded
the loop, it may still not work correctly, given the existence of data
dependencies that the compiler ignores due to the presence of OpenMP
pragmas.
The theory of data dependence imposes two requirements that must be
met for a statement S2 and to be data dependent on statement S1:
1) There must exist a
possible execution path such that statement S1 and S2 both reference
the same memory location L.
2) The execution of S1 that
references L occurs before the execution of S2 that references L.
In order for S2 to depend upon S1, it is necessary for some
execution of S1 to write to a memory location L that is later read by
an execution of S2. This is also called flow dependence.
Other dependencies exist when two statements write the same memory
location L, called an output dependence, or a read occurs before a
write, called an anti-dependence. This pattern can occur in one of two
ways:
1) S1 can reference the
memory location L on one iteration of a loop; on a subsequent iteration
S2 can reference the same memory location L.
2) S1 and S2 can reference
the same memory location L on the same loop iteration, but with S1
preceding S2 during execution of the loop iteration.
The first case is an example of loop-carried dependence, since the
dependence exists when the loop is iterated. The second case is an
example of loop-independent dependence; the dependence exists because
of the position of the code within the loops.
Table 6.1 below shows three
cases of loop-carried dependences with dependence distance d, where 1
<_ d >_ n, and n is the loop upper bound.
 |
| Table
6.1 The Different Cases of Loop-carried Dependences |
Let's take a look at the following example where d = 1 and n = 99.
The write operation is to location x[k] at iteration k in S1, and a
read from it at iteration k+1 in S2, thus a loop-carried flow
dependence occurs.
Furthermore, with the read from location y[k"1] at iteration k in
S1, a write to it is performed at iteration k+1 in S2, hence, the
loop-carried anti-dependence exists. In this case, if a parallel for
pragma is inserted for threading this loop, you will get a wrong
result.

Because OpenMP directives are commands to the compiler, the compiler
will thread this loop. However, the threaded code will fail because of
loop-carried dependence. The only way to fix this kind of problem is to
rewrite the loop or to pick a different algorithm that does not contain
the loop-carried dependence.
With this example, you can first predetermine the initial value of
x[49] and y[49]; then, you can apply the loop strip-mining technique to
create a loop-carried dependence-free loop m.
Finally, you can insert the parallel for to parallelize the loop m.
By applying this transformation, the original loop can be executed by
two threads on a dual-core processor system.
Besides using parallel for pragma, for the same example, you
can also use the parallel sections pragma to parallelize the original
open loop that has loop-carried dependence for a dual core processor
system.

Data-race
Conditions. Some
data-race conditions are often
due to output dependences, in which
multiple threads attempt to update the same memory location, or
variable, after threading.
In general, the OpenMP C++ and Fortran compilers do honor OpenMP
pragmas or directives while encountering them during compilation phase,
however, the compiler does not perform or ignores the detection of
data-race conditions.
Thus, a loop similar to the following example, in which multiple
threads are updating the variable x will lead to undesirable results.
In such a situation, the code needs to be modified via privatization or
synchronized using mechanisms like mutexes.
For example, you can simply add the private(x) clause to the
parallel for pragma to eliminate the data-race condition on variable x
for this loop.

As discussed previously, data-race conditions can sometimes be
difficult to spot; that is, more difficult than shown in this example.
When using the full thread synchronization tools in the Windows API or
in Pthreads,
developers are
more likely to avoid these issues because
data is designed from the start to be managed with threading contention
and race conditions in mind.
However, when using OpenMP, it is easier to overlook data-race
conditions. One tool that helps identify such situations is Intel
Thread Checker, which is an add-on to Intel VTune Performance Analyzer.
Managing Shared and Private Data
In writing multithreaded programs, understanding which data is shared
and which is private becomes extremely important, not only to
performance, but also for program correctness.
OpenMP makes this distinction apparent to the programmer through a
set of clauses such as shared, private, and default, and it is
something that you can set manually. With OpenMP, it is the developer's
responsibility to indicate to the compiler which pieces of memory
should be shared among the threads and which pieces should be kept
private.
When memory is identified as shared, all threads access the exact
same memory location. When memory is identified as private, however, a
separate copy of the variable is made for each thread to access in
private. When the loop exits, these private copies become undefined. By
default, all the variables in a parallel region are shared, with three
exceptions.
1) In parallel for loops, the
loop index is private. In the next
example, the k variable is private.
2) Variables that are local to
the block of the parallel region
are private.
3) Any variables listed in the
private,
firstprivate, lastprivate, or reduction clauses are private. The
privatization is done by making a distinct copy of each of these
variables for each thread.
Each of the four clauses takes a
list of
variables, but their
semantics are all different. The private clause says that each variable
in the list should have a private copy made for each thread.
This
private copy is initialized with its default value, using its
default constructor where appropriate. For example, the default value
for variables of type int is 0. In OpenMP, memory can be declared as
private in the following three ways:
1) Use the private,
firstprivate, lastprivate, or reduction clause
to specify variables that need to be private for each thread.
2) Use the threadprivate
pragma to specify the global variables that
need to be private for each thread.
3) Declare the variable
inside the loop - really inside the OpenMP
parallel region - without the static keyword. Because static variables
are statically allocated in a designated memory area by the compiler
and linker, they are not truly private like other variables declared
within a function, which are allocated within the stack frame for the
function.
The following loop fails to function correctly because the variable
x is shared. It needs to be private. Given example below, it fails due
to the loop-carried output dependence on the variable x.
The x is shared among all threads based on OpenMP default shared
rule, so there is a data-race condition on the x while one thread is
reading x, another thread might be writing to it

This problem can be fixed in either of the following two ways, which
both declare the variable x as private memory.

Every time you use OpenMP to parallelize a loop, you should
carefully examine all memory references, including the references made
by called functions.
Next in Part 2: Loop scheduling,
partitioning and thread overhead
This article was excerpted from Multi-Core
Programming by Shameem
Akhter and Jason Roberts. Copyright © 2006 Intel Corporation. All
rights reserved.
Shameem Akhter is a platform
architect at Intel Corporation,
focusing on single socket multi-core architecture and performance
analysis. Jason Roberts is a senior software engineer at Intel, and has
worked on a number of different multi-threaded software products that
span a wide range of applications targeting desktop, handheld, and
embedded DSP platforms.
To read more on
Embedded.com about
the topics discussed in this article, go to "More
about multicores,
multiprocessors and multithreading."