The major CPU vendors are shifting gears, choosing to add parallelismsupport on-chip with multi-core processors in order to avoid many ofthe technological hurdles in boosting speeds, while still offering abetter performing processor.
However, if your software does not take advantage of these multiplecores, it may not run any faster. That is where OpenMPplays a key role by providing an easy method for threading applicationswithout burdening the programmer with the complications of creating,synchronizing, load balancing, and destroying threads.
The OpenMP standard was formulated in 1997 as an API for writing portable,multithreaded applications. It started as a Fortran-based standard, butlatergrew to include C and C++. The current version is OpenMP Version2.5,which supports Fortran, C, and C++.
Intel C++ and Fortran compilers support the OpenMP Version 2.5standard. The OpenMP programming model provides a platform-independentset of compiler pragmas, directives, function calls, and environmentvariables that explicitly instruct the compiler how and where to useparallelism in the application.
Many loops can be threaded by inserting only one pragma right beforethe loop, as demonstrated by examples in this chapter. By leaving thenitty-gritty details to the compiler and OpenMP runtime library, youcan spend more time determining which loops should be threaded and howto best restructure the algorithms for performance on multi-coreprocessors. The full potential of OpenMP is realized when it is used tothread the most time-consuming loops; that is, the hot spots.
Tackling the topic of OpenMP is an intimidating task. Therefore,this series of articles serves as a bridge for you, allowing you toreach a point where you have a fundamental understanding of threadingwith OpenMP from which you can build your broader practical knowledge.
The power and simplicity of OpenMP can be demonstrated by looking atan example. The following loop converts each 32-bit RGB (red, green,blue) pixel in an array into an 8-bit grayscale pixel. The one pragma,which has been inserted immediately before the loop, is all that isneeded for parallel execution under OpenMP.
Let's take a closer look at the loop. First, the example useswork-sharing, which is the general term that OpenMP uses to describedistributing work across threads. When work-sharing is used with thefor construct, as shown in this example, the iterations of the loop aredistributed among multiple threads.
The OpenMP implementation determines how many threads to create andhow best to manage them. All the programmer needs to do is to tellOpenMP which loop should be threaded. No need for programmers to add alot of codes for creating, initializing, managing, and killing threadsin order to exploit parallelism.
OpenMP compiler and runtime library take care of these and manyother details behind the scenes. In the current OpenMP specificationVersion 2.5, OpenMP places the following five restrictions on whichloops can be threaded:
1) The loop variable mustbe of type signed integer. Unsigned integers will not work. (Note: this restriction is to be removed inthe future OpenMP specification Version 3.0 ).
2) The comparison operationmust be in the form loop_variable <, <=, >, or >=loop_invariant_integer.
3) The thirdexpression or increment portion of the for loop must be either integeraddition or integer subtraction and by a loop-invariant value.
4) If the comparisonoperation is < or <=, the loop variable must increment on everyiteration; conversely, if the comparison operation is > or >=,the loop variable must decrement on every iteration.
5) The loop must be asingle entry and single exit loop, meaning no jumps from the inside ofthe loop to the outside or outside to the inside are permitted with theexception of the exit statement, which terminates the wholeapplication. If the statements goto or break are used, they must jumpwithin the loop, not outside it. The same goes for exception handling;exceptions must be caught within the loop.
Although these restrictions may sound somewhat limiting, most loopscan easily be rewritten to conform to them. The restrictions listedabove must be observed so that the compiler can parallelize loops viaOpenMP. However, even when the compiler parallelizes the loop, you muststill ensure the loop is functionally correct by watching out for theissues in the next section.
Challenges in Threading aparallized loop
Threading a loop is to convert independent loop iterations to threadsand run these threads in parallel. In some sense, this is a re-orderingtransformation in which the original order of loop iterations can beconverted to into an undetermined order.
In addition, because the loop body is not an atomic operation,statements in the two different iterations may run simultaneously. Intheory, it is valid to convert a sequential loop to a threaded loop ifthe loop carries no dependence.
Therefore, the first challenge for youis to identify or restructure the hot loop to make sure that it has noloop-carried dependence before adding OpenMP pragmas.
Loop-carriedDependence .Even if the loop meets all five loop criteria and the compiler threadedthe loop, it may still not work correctly, given the existence of datadependencies that the compiler ignores due to the presence of OpenMPpragmas.
The theory of data dependence imposes two requirements that must bemet for a statement S2 and to be data dependent on statement S1:
1) There must exist apossible execution path such that statement S1 and S2 both referencethe same memory location L.
2) The execution of S1 thatreferences L occurs before the execution of S2 that references L.
In order for S2 to depend upon S1, it is necessary for someexecution of S1 to write to a memory location L that is later read byan execution of S2. This is also called flow dependence.
Other dependencies exist when two statements write the same memorylocation L, called an output dependence, or a read occurs before awrite, called an anti-dependence. This pattern can occur in one of twoways:
1) S1 can reference thememory location L on one iteration of a loop; on a subsequent iterationS2 can reference the same memory location L.
2) S1 and S2 can referencethe same memory location L on the same loop iteration, but with S1preceding S2 during execution of the loop iteration.
The first case is an example of loop-carried dependence, since thedependence exists when the loop is iterated. The second case is anexample of loop-independent dependence; the dependence exists becauseof the position of the code within the loops.
Table 6.1 below shows threecases of loop-carried dependences with dependence distance d, where 1<_ d >_ n, and n is the loop upper bound.
|Table6.1 The Different Cases of Loop-carried Dependences|
Let's take a look at the following example where d = 1 and n = 99.The write operation is to location x[k] at iteration k in S1, and aread from it at iteration k+1 in S2, thus a loop-carried flowdependence occurs.
Furthermore, with the read from location y[k”1] at iteration k inS1, a write to it is performed at iteration k+1 in S2, hence, theloop-carried anti-dependence exists. In this case, if a parallel forpragma is inserted for threading this loop, you will get a wrongresult.
Because OpenMP directives are commands to the compiler, the compilerwill thread this loop. However, the threaded code will fail because ofloop-carried dependence. The only way to fix this kind of problem is torewrite the loop or to pick a different algorithm that does not containthe loop-carried dependence.
With this example, you can first predetermine the initial value ofx and y; then, you can apply the loop strip-mining technique tocreate a loop-carried dependence-free loop m.
Finally, you can insert the parallel for to parallelize the loop m.By applying this transformation, the original loop can be executed bytwo threads on a dual-core processor system.
Besides using parallel for pragma, for the same example, youcan also use the parallel sections pragma to parallelize the originalopen loop that has loop-carried dependence for a dual core processorsystem.
Data-raceConditions. Somedata-race conditions are oftendue to output dependences, in whichmultiple threads attempt to update the same memory location, orvariable, after threading.
In general, the OpenMP C++ and Fortran compilers do honor OpenMPpragmas or directives while encountering them during compilation phase,however, the compiler does not perform or ignores the detection ofdata-race conditions.
Thus, a loop similar to the following example, in which multiplethreads are updating the variable x will lead to undesirable results.In such a situation, the code needs to be modified via privatization orsynchronized using mechanisms like mutexes.
For example, you can simply add the private(x) clause to theparallel for pragma to eliminate the data-race condition on variable xfor this loop.
As discussed previously, data-race conditions can sometimes bedifficult to spot; that is, more difficult than shown in this example.When using the full thread synchronization tools in the Windows API orin Pthreads ,developers aremore likely to avoid these issues becausedata is designed from the start to be managed with threading contentionand race conditions in mind.
However, when using OpenMP, it is easier to overlook data-raceconditions. One tool that helps identify such situations is IntelThread Checker, which is an add-on to Intel VTune Performance Analyzer.
Managing Shared and Private Data
In writing multithreaded programs, understanding which data is sharedand which is private becomes extremely important, not only toperformance, but also for program correctness.
OpenMP makes this distinction apparent to the programmer through aset of clauses such as shared, private, and default, and it issomething that you can set manually. With OpenMP, it is the developer'sresponsibility to indicate to the compiler which pieces of memoryshould be shared among the threads and which pieces should be keptprivate.
When memory is identified as shared, all threads access the exactsame memory location. When memory is identified as private, however, aseparate copy of the variable is made for each thread to access inprivate. When the loop exits, these private copies become undefined. Bydefault, all the variables in a parallel region are shared, with threeexceptions.
1) In parallel for loops, theloop index is private. In the nextexample, the k variable is private.
2) Variables that are local tothe block of the parallel regionare private.
3) Any variables listed in theprivate,firstprivate, lastprivate, or reduction clauses are private. Theprivatization is done by making a distinct copy of each of thesevariables for each thread.
Each of the four clauses takes alist ofvariables, but theirsemantics are all different. The private clause says that each variablein the list should have a private copy made for each thread.
Thisprivate copy is initialized with its default value, using itsdefault constructor where appropriate. For example, the default valuefor variables of type int is 0. In OpenMP, memory can be declared asprivate in the following three ways:
1) Use the private,firstprivate, lastprivate, or reduction clauseto specify variables that need to be private for each thread.
2) Use the threadprivatepragma to specify the global variables thatneed to be private for each thread.
3) Declare the variableinside the loop – really inside the OpenMPparallel region – without the static keyword. Because static variablesare statically allocated in a designated memory area by the compilerand linker, they are not truly private like other variables declaredwithin a function, which are allocated within the stack frame for thefunction.
The following loop fails to function correctly because the variablex is shared. It needs to be private. Given example below, it fails dueto the loop-carried output dependence on the variable x.
The x is shared among all threads based on OpenMP default sharedrule, so there is a data-race condition on the x while one thread isreading x, another thread might be writing to it
This problem can be fixed in either of the following two ways, whichboth declare the variable x as private memory.
Every time you use OpenMP to parallelize a loop, you shouldcarefully examine all memory references, including the references madeby called functions.
This article was excerpted from Multi-CoreProgrammingby ShameemAkhter and Jason Roberts. Copyright © 2006 Intel Corporation. Allrights reserved.
Shameem Akhter is a platformarchitect at Intel Corporation,focusing on single socket multi-core architecture and performanceanalysis. Jason Roberts is a senior software engineer at Intel, and hasworked on a number of different multi-threaded software products thatspan a wide range of applications targeting desktop, handheld, andembedded DSP platforms.
To read more onEmbedded.com aboutthe topics discussed in this article, go to “Moreabout multicores,multiprocessors and multithreading.”