By Shameem Akhter and Jason Roberts, Intel Corp.
In writing multithreaded programs, understanding which data is shared
and which is private becomes extremely important, not only to
performance, but also for program correctness.
OpenMP makes this distinction apparent to the programmer through a
set of clauses such as shared, private, and default, and it is
something that you can set manually.
With OpenMP, it is the developer's responsibility to indicate to the
compiler which pieces of memory should be shared among the threads and
which pieces should be kept private.
When memory is identified as shared, all threads access the exact
same memory location. When memory is identified as private, however, a
separate copy of the variable is made for each thread to access in
private. When the loop exits, these private copies become undefined.
By default, all the variables in a parallel region are shared, with
three exceptions. First, in parallel for loops, the loop index is
private. In the next example, the k variable is private. Second,
variables that are local to the block of the parallel region are
private. And third, any variables listed in the private, firstprivate,
lastprivate, or reduction clauses are private. The privatization is
done by making a distinct copy of each of these variables for each
thread.
Each of the four clauses takes a list of variables, but their
semantics are all different. The private clause says that each variable
in the list should have a private copy made for each thread.
Declaring Memory Private
This private copy is initialized with its default value, using its
default constructor where appropriate. For example, the default value
for variables of type int is 0. In OpenMP, memory can be declared as
private in the following three ways:
1) Use the private,
firstprivate, lastprivate, or reduction clause to specify variables
that need to be private for each thread.
2) Use the threadprivate
pragma to specify the global variables that need to be private for each
thread.
3) Declare the variable
inside the loop - really inside the OpenMP parallel region - without
the static keyword. Because static variables are statically allocated
in a designated memory area by the compiler and linker, they are not
truly private like other variables declared within a function, which
are allocated within the stack frame for the function.
The following loop fails to function correctly because the variable
x is shared. It needs to be private. Given example below, it fails due
to the loop-carried output dependence on the variable x. The x is
shared among all threads based on OpenMP default shared rule, so there
is a data-race condition on the x while one thread is reading x,
another thread might be writing to it
This problem can be fixed in either of the following two ways, which
both declare the variable x as private memory.
Every time you use OpenMP to parallelize a loop, you should
carefully examine all memory references, including the references made
by called functions.
Loop Scheduling and Partitioning
To have good load balancing and thereby achieve optimal performance in
a multithreaded application, you must have effective loop scheduling
and partitioning.
The ultimate goal is to ensure that the execution cores are busy
most, if not all, of the time, with minimum overhead of scheduling,
context switching and synchronization. With a poorly balanced workload,
some threads may finish significantly before others, leaving processor
resources idle and wasting performance opportunities.
In order to provide an easy way for you to adjust the workload among
cores, OpenMP offers four scheduling schemes that are appropriate for
many situations: static, dynamic, runtime, and guided. The Intel C++
and Fortran compilers support all four of these scheduling schemes.
A poorly balanced workload is often caused by variations in compute
time among loop iterations. It is usually not too hard to determine the
variability of loop iteration compute time by examining the source
code.
In most cases, you will see that loop iterations consume a uniform
amount of time. When that's not true, it may be possible to find a set
of iterations that do consume similar amounts of time.
For example, sometimes the set of all even iterations consumes about
as much time as the set of all odd iterations, or the set of the first
half of the loop consumes about as much time as the second half. On the
other hand, it may be impossible to find sets of loop iterations that
have a uniform execution time.
In any case, you can provide loop scheduling information via the
schedule(kind [, chunksize]) clause, so that the compiler and runtime
library can better partition and distribute the iterations of the loop
across the threads, and therefore the cores, for optimal load
balancing.
Static-even scheduling
By default, an OpenMP parallel for or worksharing for loop uses
static-even scheduling. This means the iterations of a loop are
distributed among the threads in a roughly equal number of iterations.
If m iterations and N threads are in the thread team, each thread gets
m/N iterations, and the compiler and runtime library correctly handles
the case when m is not evenly divisible by N.
With the static-even scheduling scheme, you could minimize the
chances of memory conflicts that can arise when more than one processor
is trying to access the same piece of memory.
This approach is workable because loops generally touch memory
sequentially, so splitting up the loop into large chunks results in
little chance of overlapping memory and a reasonable chance of good
processor cache efficiency. Consider the following simple loop when
executed using static-even scheduling and two threads.
#pragma omp parallel for for ( k =
0; k < 1000; k++ ) do_work(k);
OpenMP will execute loop iterations 0 to 499 on one thread and 500
to 999 on the other thread. While this partition of work might be a
good choice for memory issues, it could be bad for load balancing.
Unfortunately, the converse is also true: what might be good for load
balancing could be bad for memory performance.
Therefore, performance engineers must strike a balance between
optimal memory usage and optimal load balancing by measuring
performance to see what method produces the best results.
Loop-scheduling and partitioning information is conveyed to the
compiler and runtime library on the OpenMP for construct with the
schedule clause.
#pragma omp for schedule(kind [,
chunk-size])
The four schedule schemes specified in the OpenMP standard are
summarized in Table 6.2 below.
The optional parameter chunk-size, when specified, must be a
loop-invariant positive integer constant or integer expression.
Be careful when you adjust the chunk size, because performance can
be adversely affected. As the chunk size shrinks, the number of times a
thread needs to retrieve work from the work queue increases. As a
result, the overhead of going to the work queue increases, thereby
reducing performance and possibly offsetting the benefits of load
balancing.
 |
| Table
6.2 The Four Schedule Schemes in OpenMP |
For dynamic scheduling, the chunks are handled with the first-come,
first-serve scheme, and the default chunk size is 1. Each time, the
number of iterations grabbed is equal to the chunk size specified in
the schedule clause for each thread, except the last chunk.
After a thread has finished executing the iterations given to it, it
requests another set of chunk-size iterations. This continues until all
of the iterations are completed. The last set of iterations may be less
than the chunk size.
For example, if the chunk size is specified as 16 with the
schedule(dynamic,16) clause and the total number of iterations is 100,
the partition would be 16,16,16,16,16,16,4 with a total of seven
chunks.
For guided scheduling, the partitioning of a loop is done based on
the following formula with a start value of B0 = number of
loop iterations.
where N is the number of threads, (pi)k denotes the size
of the
(pi)'th chunk, starting from the 0'th chunk, and (pi)k
denotes the
number of remaining unscheduled loop iterations while computing the
size of k'th chunk.
When (pi)k gets too small, the value gets clipped to the chunk size
S that is specified in the schedule (guided, chunk-size) clause. The
default chunk size setting is 1, if it is not specified in the schedule
clause. Hence, for the guided scheduling, the way a loop is partitioned
depends on the number of threads (N), the number of iterations (B0) and the chunk
size (S).
For example, given a loop with B0 =
800, N = 2, and S = 80, the loop partition is {200, 150, 113, 85, 80,
80, 80, 12}. When (pi)4 is smaller than 80, it gets clipped
to 80.
When the number of remaining unscheduled iterations is smaller than
S, the upper bound of the last chunk is trimmed whenever it is
necessary. The guided scheduling supported in the Intel C++ and Fortran
compilers are a compliant implementation specified in the OpenMP Version 2.5 standard.
Dynamic and guided scheduling
With dynamic and guided scheduling mechanisms, you can tune your
application to deal with those situations where each iteration has
variable amounts of work or where some cores (or processors) are faster
than others. Typically, guided scheduling performs better than dynamic
scheduling due to less overhead associated with scheduling.
The runtime scheduling scheme is actually not a scheduling scheme
per se. When runtime is specified in the schedule clause, the OpenMP
runtime uses the scheduling scheme specified in the OMP_SCHEDULE
environment variable for this particular for loop. The format for the
OMP_SCHEDULE environment variable is schedule-type[,chunk-size]. For
example:
export OMP_SCHEDULE=dynamic,16
Using runtime scheduling gives the end-user some flexibility in
selecting the type of scheduling dynamically among three previously
mentioned scheduling mechanisms through the OMP_SCHEDULE environment
variable, which is set to static by default.
Furthermore, understanding the loop scheduling and partitioning
schemes will significantly help you to choose the right scheduling
scheme, help you to avoid false-sharing for your applications at
runtime, and lead to good load balancing. Considere the following
example:

Assume you have a dual-core processor system and the cache line
size is 64 bytes. For the sample code shown above, two chunks (or array
sections) can be in the same cache line because the chunk size is set
to 8 in the schedule clause. So each chunk of array x takes 32 bytes
per cache line, which leads to two chunks placed in the same cache
line.
Because two chunks can be read and written by two threads at the
same time, this will result in many cache line invalidations, although
two threads do not read/write the same chunk. This is called
false-sharing, as it is not necessary to actually share the same cache
line between two threads.
A simple tuning method is to use schedule(dynamic,16),
so one chunk
takes the entire cache line to eliminate the false-sharing. Eliminating
false-sharing through the use of a chunk size setting that is aware of
cache line size will significantly improve your application
performance.
Effective Use of Reductions
In large applications, you can often see the reduction operation inside
hot loops. Loops that reduce a collection of values to a single value
are fairly common. Consider the following simple loop that calculates
the sum of the return value of the integer-type function call func(k)
with the loop index value as input data.
sum = 0;
for ( k = 0; k < 100; k++ ){
sum = sum +
func(k); // "func" has no side-effects
}
It looks as though the loop-carried dependence on sum would prevent
threading. However, if you have a dual-core processor system, you can
perform the privatization - that is, create a stack variable "temp"
from which memory is allocated from automatic storage for each thread -
and perform loop partitioning to sum up the value of two sets of calls
in parallel, as shown in the following example.

At the synchronization point, you can combine the partial sum
results from each thread to generate the final sum result. In order to
perform this form of recurrence calculation in parallel, the operation
must be mathematically associative and commutative.
You may notice that the variable sum in the original sequential loop
must be shared to guarantee the correctness of the multithreaded
execution of the loop, but it also must be private to permit access by
multiple threads using a lock or a critical section for the atomic
update on the variable sum to avoid data-race condition.
To solve the problem of both sharing and protecting sum without
using a lock inside the threaded loop, OpenMP provides the reduction
clause that is used to efficiently combine certain associative
arithmetical reductions of one or more variables in a loop. The
following loop uses the reduction clause to generate the correct
results.

Given the reduction clause, the compiler creates private copies of
the variable sum for each thread, and when the loop completes, it adds
the values together and places the result in the original variable sum.
Other reduction operators
Other reduction operators besides "+" exist.
Table 6.3 below lists those C++
reduction operators specified in the OpenMP standard, along with the
initial values - which are also the mathematical identity value - for
the temporary private variables. You can also find a list of Fortran
reduction operators along with their initial values in OpenMP
specification Version 2.5.
 |
| Table
6.3 |
For each variable specified in a reduction clause, a private copy
is created, one for each thread, as if the private clause is used. The
private copy is then initialized to the initialization value for the
operator, as specified in Table 6.3
above.
At the end of the region or the loop for which the reduction clause
was specified, the original reduction variable is updated by combining
its original value with the final value of each of the private copies,
using the operator specified.
While identifying the opportunities to explore the use of the
reduction clause for threading, you should keep the following three
points in mind:
1) The value of the original
reduction variable becomes undefined when the first thread reaches the
region or loop that specifies the reduction clause and remains so until
the reduction computation is completed.
2) If the reduction clause
is used on a loop to which the nowait is also applied, the value of
original reduction variable remains undefined until a barrier
synchronization is performed to ensure that all threads have completed
the reduction.
3) The order in which the
values are combined is unspecified. Therefore, comparing sequential and
parallel runs, even between two parallel runs, does not guarantee that
bit-identical results will be obtained or that side effects, such as
floating-point exceptions, will be identical.
Minimizing Threading Overhead
Using OpenMP, you can parallelize loops, regions, and sections or
straight-line code blocks, whenever dependences do not forbids them
being executed in parallel.
In addition, because OpenMP employs the simple fork-join execution
model, it allows the compiler and run-time library to compile and run
OpenMP programs efficiently with lower threading overhead. However, you
can improve your application performance by further reducing threading
overhead.
Table 6.4 below provides
measured costs of a set of OpenMP constructs and clauses on a 4-way
Intel Xeon processor-based system running at 3.0 gigahertz with the
Intel compiler and runtime library. You can see that the cost for each
construct or clause is small.
Most of them are less than 7 microseconds except the
schedule(dynamic) clause. The schedule (dynamic) clause takes 50
microseconds, because its default chunk size is 1, which is too small.
If you use schedule(dynamic,16), its cost is reduced to 5.0
microseconds. Note that all measured costs are subject to change if you
measure these costs on a different processor or under a different
system configuration.
The key point is that no matter how well the compiler and runtime
are developed and tuned to minimize the overhead of OpenMP constructs
and clauses, you can always find ways to reduce the overhead by
exploring the use of OpenMP in a more effective way.
 |
| Table
6.4 |
Earlier, you saw how the parallel for pragma could be used to split
the iterations of a loop across multiple threads. When the compiler
generated thread is executed, the iterations of the loop are
distributed among threads. At the end of the parallel region, the
threads are suspended and they wait for the next parallel region, loop,
or sections.
A suspend or resume operation, while significantly lighter weight
than create or terminate operations, still creates overhead and may be
unnecessary when two parallel regions, loops, or sections are adjacent
as shown in the following example.

The overhead can be removed by entering a parallel region once,
then dividing the work within the parallel region. The following code
is functionally identical to the preceding code but runs faster,
because the overhead of entering a parallel region is performed only
once.

Ideally, all performance-critical portions of the application would
be executed within a parallel region. Since very few programs are
comprised only of loops, additional constructs are used to handle
non-loop code. A work-sharing section is one such construct.
Work-sharing Sections
The work-sharing sections construct directs the OpenMP compiler and
runtime to distribute the identified sections of your application among
threads in the team created for the parallel region. The following
example uses work-sharing for loops and work-sharing sections together
within a single parallel region. In this case, the overhead of forking
or resuming threads for parallel sections is eliminated.

Here, OpenMP first creates several threads. Then, the iterations of
the loop are divided among the threads. Once the loop is finished, the
sections are divided among the threads so that each section is executed
exactly once, but in parallel with the other sections.
If the program contains more sections than threads, the remaining
sections get scheduled as threads finish their previous sections.
Unlike loop scheduling, the schedule clause is not defined for
sections.
Therefore, OpenMP is in complete control of how, when, and in what
order threads are scheduled to execute the sections. You can still
control which variables are shared or private, using the private and
reduction clauses in the same fashion as the loop construct.
Next in Part 3: Performance-oriented
Programming
To read Part 1 go to The
challenges of threading a loop
This
article was excerpted from Multi-Core
Programming by Shameem Akhter and Jason Roberts.
Copyright © 2006 Intel Corporation. All rights reserved.
Shameem Akhter is a platform
architect at Intel
Corporation, focusing on single socket multi-core architecture and
performance analysis. Jason Roberts is a senior software engineer at
Intel, and has worked on a number of different multi-threaded software
products that span a wide range of applications targeting desktop,
handheld, and embedded DSP platforms.
To read more on
Embedded.com about the topics discussed in this article, go to "More
about multicores, multiprocessors and multithreading."