Use OpenMP for programming parallel threads in multicore applications: Part 1

Shameem Akhter and Jason Roberts, Intel Corp.

August 15, 2007

Shameem Akhter and Jason Roberts, Intel Corp.August 15, 2007

The major CPU vendors are shifting gears, choosing to add parallelism support on-chip with multi-core processors in order to avoid many of the technological hurdles in boosting speeds, while still offering a better performing processor.

However, if your software does not take advantage of these multiple cores, it may not run any faster. That is where OpenMP plays a key role by providing an easy method for threading applications without burdening the programmer with the complications of creating, synchronizing, load balancing, and destroying threads.

The OpenMP standard was formulated in 1997 as an API for writing portable, multithreaded applications. It started as a Fortran-based standard, but later grew to include C and C++. The current version is OpenMP Version 2.5, which supports Fortran, C, and C++.

Intel C++ and Fortran compilers support the OpenMP Version 2.5 standard. The OpenMP programming model provides a platform-independent set of compiler pragmas, directives, function calls, and environment variables that explicitly instruct the compiler how and where to use parallelism in the application.

Many loops can be threaded by inserting only one pragma right before the loop, as demonstrated by examples in this chapter. By leaving the nitty-gritty details to the compiler and OpenMP runtime library, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for performance on multi-core processors. The full potential of OpenMP is realized when it is used to thread the most time-consuming loops; that is, the hot spots.

Tackling the topic of OpenMP is an intimidating task. Therefore, this series of articles serves as a bridge for you, allowing you to reach a point where you have a fundamental understanding of threading with OpenMP from which you can build your broader practical knowledge.

The power and simplicity of OpenMP can be demonstrated by looking at an example. The following loop converts each 32-bit RGB (red, green, blue) pixel in an array into an 8-bit grayscale pixel. The one pragma, which has been inserted immediately before the loop, is all that is needed for parallel execution under OpenMP.

Let's take a closer look at the loop. First, the example uses work-sharing, which is the general term that OpenMP uses to describe distributing work across threads. When work-sharing is used with the for construct, as shown in this example, the iterations of the loop are distributed among multiple threads.

The OpenMP implementation determines how many threads to create and how best to manage them. All the programmer needs to do is to tell OpenMP which loop should be threaded. No need for programmers to add a lot of codes for creating, initializing, managing, and killing threads in order to exploit parallelism.

OpenMP compiler and runtime library take care of these and many other details behind the scenes. In the current OpenMP specification Version 2.5, OpenMP places the following five restrictions on which loops can be threaded:

1) The loop variable must be of type signed integer. Unsigned integers will not work. (Note: this restriction is to be removed in the future OpenMP specification Version 3.0).

2) The comparison operation must be in the form loop_variable <, <=, >, or >= loop_invariant_integer.

3)  The third expression or increment portion of the for loop must be either integer addition or integer subtraction and by a loop-invariant value.

4) If the comparison operation is < or <=, the loop variable must increment on every iteration; conversely, if the comparison operation is > or >=, the loop variable must decrement on every iteration.

5) The loop must be a single entry and single exit loop, meaning no jumps from the inside of the loop to the outside or outside to the inside are permitted with the exception of the exit statement, which terminates the whole application. If the statements goto or break are used, they must jump within the loop, not outside it. The same goes for exception handling; exceptions must be caught within the loop.

Although these restrictions may sound somewhat limiting, most loops can easily be rewritten to conform to them. The restrictions listed above must be observed so that the compiler can parallelize loops via OpenMP. However, even when the compiler parallelizes the loop, you must still ensure the loop is functionally correct by watching out for the issues in the next section.

Challenges in Threading a parallized loop
Threading a loop is to convert independent loop iterations to threads and run these threads in parallel. In some sense, this is a re-ordering transformation in which the original order of loop iterations can be converted to into an undetermined order.

In addition, because the loop body is not an atomic operation, statements in the two different iterations may run simultaneously. In theory, it is valid to convert a sequential loop to a threaded loop if the loop carries no dependence.

Therefore, the first challenge for you is to identify or restructure the hot loop to make sure that it has no loop-carried dependence before adding OpenMP pragmas.

Loop-carried Dependence. Even if the loop meets all five loop criteria and the compiler threaded the loop, it may still not work correctly, given the existence of data dependencies that the compiler ignores due to the presence of OpenMP pragmas.

The theory of data dependence imposes two requirements that must be met for a statement S2 and to be data dependent on statement S1:

1) There must exist a possible execution path such that statement S1 and S2 both reference the same memory location L.

2) The execution of S1 that references L occurs before the execution of S2 that references L.

In order for S2 to depend upon S1, it is necessary for some execution of S1 to write to a memory location L that is later read by an execution of S2. This is also called flow dependence.

Other dependencies exist when two statements write the same memory location L, called an output dependence, or a read occurs before a write, called an anti-dependence. This pattern can occur in one of two ways:

1) S1 can reference the memory location L on one iteration of a loop; on a subsequent iteration S2 can reference the same memory location L.

2) S1 and S2 can reference the same memory location L on the same loop iteration, but with S1 preceding S2 during execution of the loop iteration.

The first case is an example of loop-carried dependence, since the dependence exists when the loop is iterated. The second case is an example of loop-independent dependence; the dependence exists because of the position of the code within the loops.

Table 6.1 below shows three cases of loop-carried dependences with dependence distance d, where 1 <_ d >_ n, and n is the loop upper bound.

Table 6.1 The Different Cases of Loop-carried Dependences

Let's take a look at the following example where d = 1 and n = 99. The write operation is to location x[k] at iteration k in S1, and a read from it at iteration k+1 in S2, thus a loop-carried flow dependence occurs.

Furthermore, with the read from location y[k"1] at iteration k in S1, a write to it is performed at iteration k+1 in S2, hence, the loop-carried anti-dependence exists. In this case, if a parallel for pragma is inserted for threading this loop, you will get a wrong result.

Because OpenMP directives are commands to the compiler, the compiler will thread this loop. However, the threaded code will fail because of loop-carried dependence. The only way to fix this kind of problem is to rewrite the loop or to pick a different algorithm that does not contain the loop-carried dependence.

With this example, you can first predetermine the initial value of x[49] and y[49]; then, you can apply the loop strip-mining technique to create a loop-carried dependence-free loop m.

Finally, you can insert the parallel for to parallelize the loop m. By applying this transformation, the original loop can be executed by two threads on a dual-core processor system.

Besides using parallel for  pragma, for the same example, you can also use the parallel sections pragma to parallelize the original open loop that has loop-carried dependence for a dual core processor system.

Data-race Conditions.
Some data-race conditions are often due to output dependences, in which multiple threads attempt to update the same memory location, or variable, after threading.

In general, the OpenMP C++ and Fortran compilers do honor OpenMP pragmas or directives while encountering them during compilation phase, however, the compiler does not perform or ignores the detection of data-race conditions.

Thus, a loop similar to the following example, in which multiple threads are updating the variable x will lead to undesirable results. In such a situation, the code needs to be modified via privatization or synchronized using mechanisms like mutexes.

For example, you can simply add the private(x) clause to the parallel for pragma to eliminate the data-race condition on variable x for this loop.

As discussed previously, data-race conditions can sometimes be difficult to spot; that is, more difficult than shown in this example. When using the full thread synchronization tools in the Windows API or in Pthreads, developers are more likely to avoid these issues because data is designed from the start to be managed with threading contention and race conditions in mind.

However, when using OpenMP, it is easier to overlook data-race conditions. One tool that helps identify such situations is Intel Thread Checker, which is an add-on to Intel VTune Performance Analyzer.

Managing Shared and Private Data
In writing multithreaded programs, understanding which data is shared and which is private becomes extremely important, not only to performance, but also for program correctness.

OpenMP makes this distinction apparent to the programmer through a set of clauses such as shared, private, and default, and it is something that you can set manually. With OpenMP, it is the developer's responsibility to indicate to the compiler which pieces of memory should be shared among the threads and which pieces should be kept private.

When memory is identified as shared, all threads access the exact same memory location. When memory is identified as private, however, a separate copy of the variable is made for each thread to access in private. When the loop exits, these private copies become undefined. By default, all the variables in a parallel region are shared, with three exceptions.

1) In parallel for loops, the loop index is private. In the next example, the k variable is private.
2) Variables that are local to the block of the parallel region are private.
3) Any variables listed in the private, firstprivate, lastprivate, or reduction clauses are private. The privatization is done by making a distinct copy of each of these variables for each thread.

Each of the four clauses takes a list of variables, but their semantics are all different. The private clause says that each variable in the list should have a private copy made for each thread.

This private copy is initialized with its default value, using its default constructor where appropriate. For example, the default value for variables of type int is 0. In OpenMP, memory can be declared as private in the following three ways:

1) Use the private, firstprivate, lastprivate, or reduction clause to specify variables that need to be private for each thread.

2) Use the threadprivate pragma to specify the global variables that need to be private for each thread.

3) Declare the variable inside the loop - really inside the OpenMP parallel region - without the static keyword. Because static variables are statically allocated in a designated memory area by the compiler and linker, they are not truly private like other variables declared within a function, which are allocated within the stack frame for the function.

The following loop fails to function correctly because the variable x is shared. It needs to be private. Given example below, it fails due to the loop-carried output dependence on the variable x.

The x is shared among all threads based on OpenMP default shared rule, so there is a data-race condition on the x while one thread is reading x, another thread might be writing to it

This problem can be fixed in either of the following two ways, which both declare the variable x as private memory.

Every time you use OpenMP to parallelize a loop, you should carefully examine all memory references, including the references made by called functions.

To read Part 2, go to Loop scheduling, partitioning and thread overhead
To read Part 3, go to Performance-oriented Programming
To read Part 4, go to OpenMP Library Functions

This article was excerpted from Multi-Core Programming by Shameem Akhter and Jason Roberts. Copyright © 2006 Intel Corporation. All rights reserved.

Shameem Akhter is a platform architect at Intel Corporation, focusing on single socket multi-core architecture and performance analysis. Jason Roberts is a senior software engineer at Intel, and has worked on a number of different multi-threaded software products that span a wide range of applications targeting desktop, handheld, and embedded DSP platforms.

To read more on about the topics discussed in this article, go to "More about multicores, multiprocessors and multithreading."

Loading comments...