By Shameem Akhter and Jason Roberts, Intel Corp.
OpenMP provides a set of important pragmas and runtime functions that
enable thread synchronization and related actions to facilitate correct
parallel programming. Using these pragmas and runtime functions
effectively with minimum overhead and thread waiting time is extremely
important for achieving optimal performance from your applications.
Using Barrier and Nowait
Barriers are a form of synchronization method that OpenMP employs to
synchronize threads. Threads will wait at a barrier until all the
threads in the parallel region have reached the same point.
You have been using implied barriers without realizing it in the
work-sharing for and work-sharing sections constructs. At the end of
the parallel, for, sections, and single constructs, an implicit barrier
is generated by the compiler or invoked in the runtime library.
The barrier causes execution to wait for all threads to finish the
work of the loop, sections, or region before any go on to execute
additional work. This barrier can be removed with the nowait clause, as
shown in the following code sample.
In this example, since data is not dependent between the first
work-sharing for loop and the second work-sharing sections code block,
the threads that process the first work-sharing for loop continue
immediately to the second work-sharing sections without waiting for all
threads to finish the first loop.
Depending upon your situation, this behavior may be beneficial,
because it can make full use of available resources and reduce the
amount of time that threads are idle. The nowait clause can also be
used with the work-sharing sections construct and single construct to
remove its implicit barrier at the end of the code block.
Adding an explicit barrier is also supported by OpenMP as shown in
the following example through the barrier pragma.

In this example, the OpenMP code is to be executed by two threads;
one thread writes the result to the variable y, and another thread
writes the result to the variable z. Both y and z are read in the
work-sharing for loop, hence, two flow dependences exist.
In order to obey the data dependence constraints in the code for
correct threading, you need to add an explicit barrier pragma right
before the work-sharing for loop to guarantee that the value of both y
and z are ready for read.
In real applications, the barrier pragma is especially useful when
all threads need to finish a task before any more work can be
completed, as would be the case, for example, when updating a graphics
frame buffer before displaying its contents.
Interleaving Single-thread and
Multi-thread Execution
In large real-world applications, a program may consist of both serial
and parallel code segments due to various reasons such as data
dependence constraints and I/O operations.
A need to execute something only once by only one thread will
certainly be required within a parallel region, especially because you
are making parallel regions as large as possible to reduce overhead.
To handle the need for single-thread execution, OpenMP provides a
way to specify that a sequence of code contained within a parallel
section should only be executed one time by only one thread.
The OpenMP runtime library decides which single thread will do the
execution. If need be, you can specify that you want only the master
thread, the thread that actually started the program execution, to
execute the code, as in the following example.
As can be seen from the comments in this code, a remarkable amount
of synchronization and management of thread execution is available in a
comparatively compact lexicon of pragmas.
Note that all low-level details are taken care of by the OpenMP
compiler and runtime. What you need to focus on is to specify parallel
computation and synchronization behaviors you expected for correctness
and performance.
In other words, using single and master pragmas along with the
barrier pragma and nowait clause in a clever way, you should be able to
maximize the scope of a parallel region and the overlap of computations
to reduce threading overhead effectively, while obeying all data
dependences and I/O constraints in your programs.
Data Copy-in and Copy-out
When you parallelize a program, you would normally have to deal with
how to copy in the initial value of a private variable to initialize
its private copy for each thread in the team.
You would also copy out the value of the private variable computed
in the last iteration/section to its original variable for the master
thread at the end of parallel region.
OpenMP standard provides four clauses - firstprivate,
lastprivate, copyin, and copyprivate - for you to accomplish the data
copy-in and copy-out operations whenever necessary based on your
program and parallelization scheme. The following descriptions
summarize the semantics of these four clauses:
* firstprivate
provides a way to initialize the value of a private variable for each
thread with the value of variable from the master thread. Normally,
temporary private variables have an undefined initial value saving the
performance overhead of the copy.
* lastprivate
provides a way to copy out the value of the private variable computed
in the last iteration/section to the copy of the variable in the master
thread. Variables can be declared both firstprivate and lastprivate at
the same time.
* copyin
provides a way to copy the master thread's threadprivate variable to
the threadprivate variable of each other member of the team executing
the parallel region.
* copyprivate
provides a way to use a private variable to broadcast a value from one
member of threads to other members of the team executing the parallel
region. The copyprivate clause is allowed to associate with the single
construct; the broadcast action is completed before any of threads in
the team left the barrier at the end of construct.
Considering the code example, let's see how it works. The following
code converts a color image to black and white.

The issue is how to move the pointers pGray and pRGB to the correct
place within the bitmap while threading the outer "row" loop. The
address computation for each pixel can be done with the following code:
pDestLoc = pGray + col + row *
GrayStride;
pSrcLoc = pRGB + col + row * RGBStride;
The above code, however, executes extra math on each pixel for the
address computation. Instead, the firstprivate clause can be used to
perform necessary initialization to get the initial address of pointer
pGray and pRGB for each thread.
You may notice that the initial addresses of the pointer pGray and
pRGB have to be computed only once based on the "row" number and their
initial addresses in the master thread for each thread; the pointer
pGray and pRGB are induction pointers and updated in the outer loop for
each "row" iteration.
This is the reason the bool-type variable doInit is introduced with
an initial value TRUE to make sure the initialization is done only once
for each to compute the initial address of pointer pGray and pRGB. The
parallelized code follows:

If you take a close look at this code, you may find that the four
variables GrayStride, RGBStride, height, and width are read-only
variables. In other words, no write operation is performed to these
variables in the parallel loop. Thus, you can also specify them on the
parallel for loop by adding the code below:
firstprivate (GrayStride,
RGBStride, height, width)
You may get better performance in some cases, as the privatization
helps the compiler to perform more aggressive registerization and code
motion as their loop invariants reduce memory traffic.
Protecting Updates of Shared
Variables
The critical and atomic pragmas are supported by the OpenMP standard
for you to protect the updating of shared variables for avoiding
data-race conditions. The code block enclosed by a critical section and
an atomic pragma are areas of code that may be entered only when no
other thread is executing in them. The following example uses an
unnamed critical section.
#pragma omp critical
{
if ( max <
new_value ) max = new_value
}
Global, or unnamed, critical sections will likely and unnecessarily
affect performance because every thread is competing to enter the same
global critical section, as the execution of every thread is
serialized. This is rarely what you want.
For this reason, OpenMP offers named critical sections. Named
critical sections enable fine-grained synchronization, so only the
threads that need to block on a particular section will do so.
The following example shows the code that improves the previous
example. In practice, named critical sections are used when more than
one thread is competing for more than one critical resource.
#pragma omp critical(maxvalue)
{
if ( max <
new_value ) max = new_value
}
With named critical sections, applications can have multiple
critical sections, and threads can be in more than one critical section
at a time. It is important to remember that entering nested critical
sections runs the risk of deadlock. The following code example code
shows a deadlock situation:

In the previous code, the dynamically nested critical sections are
used. When the function do_work is called inside a parallel loop,
multiple threads compete to enter the outer critical section.
The thread that succeeds in entering the outer critical section will
call the dequeue function; however, the dequeue function cannot make
any further progress, as the inner critical section attempts to enter
the same critical section in the do_work function.
Thus, the do_work function could never complete. This is a deadlock
situation. The simple way to fix the problem in the previous code is to
do the inlining of the dequeue function in the do_work function as
follows:

When using multiple critical sections, be very careful to examine
critical sections that might be lurking in subroutines. In addition to
using critical sections, you can also use the atomic pragma for
updating shared variables.
When executing code in parallel, it is impossible to know when an
operation will be interrupted by the thread scheduler. However,
sometimes you may require that a statement in a high-level language
complete in its entirety before a thread is suspended. For example, a
statement x++ is translated into a sequence of machine instructions
such as:
load reg, [x];
add reg 1;
store [x], reg;
It is possible that the thread is swapped out between two of these
machine instructions. The atomic pragma directs the compiler to
generate code to ensure that the specific memory storage is updated
atomically. The following code example shows a usage of the atomic
pragma.

An expression statement that is allowed to use the atomic pragma
must be with one of the following forms:
x binop = expr
x++
++x
x --
-- x
In the preceding expressions, x is an lvalue expression with scalar
type; expr is an expression with scalar type and does not reference the
object designed by x; binop is not an overloaded operator and is one of
+, *, -, /, &, ^, <<, or
>> for the C/C++
language.
It is worthwhile to point out that in the preceding code example,
the advantage of using the atomic pragma is that it allows update of
two different elements of array y to occur in parallel. If a critical
section were used instead, then all updates to elements of array y
would be executed serially, but not in a guaranteed order.
Furthermore, in general, the OpenMP compiler and runtime library
select the most efficient method to implement the atomic pragma given
operating system features and hardware capabilities. Thus, whenever it
is possible you should use the atomic pragma before using the critical
section in order to avoid data-race conditions on statements that
update a shared memory location.
Intel Taskqueuing Extension to
OpenMP
The Intel Taskqueuing extension to OpenMP allows a programmer to
parallelize control structures such as recursive function, dynamic-tree
search, and pointer-chasing while loops that are beyond the scope of
those supported by the current OpenMP model, while still fitting into
the framework defined by the OpenMP specification.
In particular, the taskqueuing model is a flexible programming model
for specifying units of work that are not pre-computed at the start of
the work-sharing construct. Take a look the following example.
The parallel taskq pragma directs the compiler to generate code to
create a team of threads and an environment for the while loop to
enqueue the units of work specified by the enclosed task pragma.
The loop's control structure and the enqueuing are executed by one
thread, while the other threads in the team participate in dequeuing
the work from the taskq queue and executing it.
The captureprivate clause ensures that a private copy of the link
pointer p is captured at the time each task is being enqueued, hence
preserving the sequential semantics. The taskqueuing execution model is
shown in Figure 6.1 below.
 |
| Figure
6.1 Taskqueuing Execution Model |
Essentially, for any given program with parallel taskq constructs,
a team of threads is created by the runtime library when the main
thread encounters a parallel region.
The runtime thread scheduler chooses one thread TK to execute
initially from all the threads that encounter a taskq pragma. All the
other threads wait for work to be put on the task queue. Conceptually,
the taskq pragma triggers this sequence of actions:
1. Causes an empty queue to
be created by the chosen thread TK
2. Enqueues each task that it
encounters
3. Executes the code inside the
taskq block as a single thread
The task pragma specifies a unit of work, potentially to be executed
by a different thread. When a task pragma is encountered lexically
within a taskq block, the code inside the task block is placed on the
queue associated with the taskq pragma.
The conceptual queue is disbanded when all work enqueued on it
finishes and the end of the taskq block is reached. The Intel C++
compiler has been extended throughout its various components to support
the taskqueuing model for generating multithreaded codes corresponding
to taskqueuing constructs.
Next in Part 4: Library functions,
compilation and debugging
To read Part 1 go to The
challenges of threading a loop
To read Part 2, go to Managing Shared and
Private Data
This
article was excerpted from Multi-Core
Programming by Shameem Akhter and Jason Roberts.
Copyright © 2006 Intel Corporation. All rights reserved.
Shameem
Akhter is a platform
architect at Intel
Corporation, focusing on single socket multi-core architecture and
performance analysis. Jason Roberts is
a senior software engineer at
Intel, and has worked on a number of different multi-threaded software
products that span a wide range of applications targeting desktop,
handheld, and embedded DSP platforms.
To read more on
Embedded.com about the topics discussed in this article, go to "More
about multicores, multiprocessors and multithreading."