Using OpenMP for programming parallel threads in multicore applications: Part 2 - Embedded.com

Using OpenMP for programming parallel threads in multicore applications: Part 2

In writing multithreaded programs, understanding which data is sharedand which is private becomes extremely important, not only toperformance, but also for program correctness.

OpenMP makes this distinction apparent to the programmer through aset of clauses such as shared, private, and default, and it issomething that you can set manually.

With OpenMP, it is the developer's responsibility to indicate to thecompiler which pieces of memory should be shared among the threads andwhich pieces should be kept private.

When memory is identified as shared, all threads access the exactsame memory location. When memory is identified as private, however, aseparate copy of the variable is made for each thread to access inprivate. When the loop exits, these private copies become undefined.

By default, all the variables in a parallel region are shared, withthree exceptions. First, in parallel for loops, the loop index isprivate. In the next example, the k variable is private. Second,variables that are local to the block of the parallel region areprivate. And third, any variables listed in the private, firstprivate,lastprivate, or reduction clauses are private. The privatization isdone by making a distinct copy of each of these variables for eachthread.

Each of the four clauses takes a list of variables, but theirsemantics are all different. The private clause says that each variablein the list should have a private copy made for each thread.

Declaring Memory Private
This private copy is initialized with its default value, using itsdefault constructor where appropriate. For example, the default valuefor variables of type int is 0. In OpenMP, memory can be declared asprivate in the following three ways:

1) Use the private,firstprivate, lastprivate, or reduction clause to specify variablesthat need to be private for each thread.

2) Use the threadprivatepragma to specify the global variables that need to be private for eachthread.

3) Declare the variableinside the loop – really inside the OpenMP parallel region – withoutthe static keyword. Because static variables are statically allocatedin a designated memory area by the compiler and linker, they are nottruly private like other variables declared within a function, whichare allocated within the stack frame for the function.

The following loop fails to function correctly because the variablex is shared. It needs to be private. Given example below, it fails dueto the loop-carried output dependence on the variable x. The x isshared among all threads based on OpenMP default shared rule, so thereis a data-race condition on the x while one thread is reading x,another thread might be writing to it

This problem can be fixed in either of the following two ways, whichboth declare the variable x as private memory.

Every time you use OpenMP to parallelize a loop, you shouldcarefully examine all memory references, including the references madeby called functions.

Loop Scheduling and Partitioning
To have good load balancing and thereby achieve optimal performance ina multithreaded application, you must have effective loop schedulingand partitioning.

The ultimate goal is to ensure that the execution cores are busymost, if not all, of the time, with minimum overhead of scheduling,context switching and synchronization. With a poorly balanced workload,some threads may finish significantly before others, leaving processorresources idle and wasting performance opportunities.

In order to provide an easy way for you to adjust the workload amongcores, OpenMP offers four scheduling schemes that are appropriate formany situations: static, dynamic, runtime, and guided. The Intel C++and Fortran compilers support all four of these scheduling schemes.

A poorly balanced workload is often caused by variations in computetime among loop iterations. It is usually not too hard to determine thevariability of loop iteration compute time by examining the sourcecode.

In most cases, you will see that loop iterations consume a uniformamount of time. When that's not true, it may be possible to find a setof iterations that do consume similar amounts of time.

For example, sometimes the set of all even iterations consumes aboutas much time as the set of all odd iterations, or the set of the firsthalf of the loop consumes about as much time as the second half. On theother hand, it may be impossible to find sets of loop iterations thathave a uniform execution time.

In any case, you can provide loop scheduling information via theschedule(kind [, chunksize]) clause, so that the compiler and runtimelibrary can better partition and distribute the iterations of the loopacross the threads, and therefore the cores, for optimal loadbalancing.

Static-even scheduling
By default, an OpenMP parallel for or worksharing for loop usesstatic-even scheduling. This means the iterations of a loop aredistributed among the threads in a roughly equal number of iterations.If m iterations and N threads are in the thread team, each thread getsm/N iterations, and the compiler and runtime library correctly handlesthe case when m is not evenly divisible by N.

With the static-even scheduling scheme, you could minimize thechances of memory conflicts that can arise when more than one processoris trying to access the same piece of memory.

This approach is workable because loops generally touch memorysequentially, so splitting up the loop into large chunks results inlittle chance of overlapping memory and a reasonable chance of goodprocessor cache efficiency. Consider the following simple loop whenexecuted using static-even scheduling and two threads.

#pragma omp parallel for for ( k =0; k < 1000; k++ ) do_work(k);

OpenMP will execute loop iterations 0 to 499 on one thread and 500to 999 on the other thread. While this partition of work might be agood choice for memory issues, it could be bad for load balancing.Unfortunately, the converse is also true: what might be good for loadbalancing could be bad for memory performance.

Therefore, performance engineers must strike a balance betweenoptimal memory usage and optimal load balancing by measuringperformance to see what method produces the best results.

Loop-scheduling and partitioning information is conveyed to thecompiler and runtime library on the OpenMP for construct with theschedule clause.

#pragma omp for schedule(kind [,chunk-size])

The four schedule schemes specified in the OpenMP standard aresummarized in Table 6.2 below. The optional parameter chunk-size, when specified, must be aloop-invariant positive integer constant or integer expression.

Be careful when you adjust the chunk size, because performance canbe adversely affected. As the chunk size shrinks, the number of times athread needs to retrieve work from the work queue increases. As aresult, the overhead of going to the work queue increases, therebyreducing performance and possibly offsetting the benefits of loadbalancing.

Table6.2 The Four Schedule Schemes in OpenMP

For dynamic scheduling, the chunks are handled with the first-come,first-serve scheme, and the default chunk size is 1. Each time, thenumber of iterations grabbed is equal to the chunk size specified inthe schedule clause for each thread, except the last chunk.

After a thread has finished executing the iterations given to it, itrequests another set of chunk-size iterations. This continues until allof the iterations are completed. The last set of iterations may be lessthan the chunk size.

For example, if the chunk size is specified as 16 with theschedule(dynamic,16) clause and the total number of iterations is 100,the partition would be 16,16,16,16,16,16,4 with a total of sevenchunks.

For guided scheduling, the partitioning of a loop is done based onthe following formula with a start value of B 0 = number ofloop iterations.

where N is the number of threads, (pi)k denotes the sizeof the(pi)'th chunk, starting from the 0'th chunk, and (pi)k denotes thenumber of remaining unscheduled loop iterations while computing thesize of k'th chunk.

When (pi)k gets too small, the value gets clipped to the chunk sizeS that is specified in the schedule (guided, chunk-size) clause. Thedefault chunk size setting is 1, if it is not specified in the scheduleclause. Hence, for the guided scheduling, the way a loop is partitioneddepends on the number of threads (N), the number of iterations (B 0 ) and the chunksize (S).

For example, given a loop with B0 =800, N = 2, and S = 80, the loop partition is {200, 150, 113, 85, 80,80, 80, 12}. When (pi)4 is smaller than 80, it gets clippedto 80.

When the number of remaining unscheduled iterations is smaller thanS, the upper bound of the last chunk is trimmed whenever it isnecessary. The guided scheduling supported in the Intel C++ and Fortrancompilers are a compliant implementation specified in the OpenMP Version 2.5 standard.

Dynamic and guided scheduling
With dynamic and guided scheduling mechanisms, you can tune yourapplication to deal with those situations where each iteration hasvariable amounts of work or where some cores (or processors) are fasterthan others. Typically, guided scheduling performs better than dynamicscheduling due to less overhead associated with scheduling.

The runtime scheduling scheme is actually not a scheduling schemeper se. When runtime is specified in the schedule clause, the OpenMPruntime uses the scheduling scheme specified in the OMP_SCHEDULEenvironment variable for this particular for loop. The format for theOMP_SCHEDULE environment variable is schedule-type[,chunk-size]. Forexample:

export OMP_SCHEDULE=dynamic,16

Using runtime scheduling gives the end-user some flexibility inselecting the type of scheduling dynamically among three previouslymentioned scheduling mechanisms through the OMP_SCHEDULE environmentvariable, which is set to static by default.

Furthermore, understanding the loop scheduling and partitioningschemes will significantly help you to choose the right schedulingscheme, help you to avoid false-sharing for your applications atruntime, and lead to good load balancing. Considere the followingexample:

Assume you have a dual-core processor system and the cache linesize is 64 bytes. For the sample code shown above, two chunks (or arraysections) can be in the same cache line because the chunk size is setto 8 in the schedule clause. So each chunk of array x takes 32 bytesper cache line, which leads to two chunks placed in the same cacheline.

Because two chunks can be read and written by two threads at thesame time, this will result in many cache line invalidations, althoughtwo threads do not read/write the same chunk. This is calledfalse-sharing, as it is not necessary to actually share the same cacheline between two threads.

A simple tuning method is to use schedule(dynamic,16) ,so one chunktakes the entire cache line to eliminate the false-sharing. Eliminatingfalse-sharing through the use of a chunk size setting that is aware ofcache line size will significantly improve your applicationperformance.

Effective Use of Reductions
In large applications, you can often see the reduction operation insidehot loops. Loops that reduce a collection of values to a single valueare fairly common. Consider the following simple loop that calculatesthe sum of the return value of the integer-type function call func(k)with the loop index value as input data.

sum = 0;
for ( k = 0; k < 100; k++ ){
     sum = sum +func(k); // “func” has no side-effects
}

It looks as though the loop-carried dependence on sum would preventthreading. However, if you have a dual-core processor system, you canperform the privatization – that is, create a stack variable “temp”from which memory is allocated from automatic storage for each thread -and perform loop partitioning to sum up the value of two sets of callsin parallel, as shown in the following example.

At the synchronization point, you can combine the partial sumresults from each thread to generate the final sum result. In order toperform this form of recurrence calculation in parallel, the operationmust be mathematically associative and commutative.

You may notice that the variable sum in the original sequential loopmust be shared to guarantee the correctness of the multithreadedexecution of the loop, but it also must be private to permit access bymultiple threads using a lock or a critical section for the atomicupdate on the variable sum to avoid data-race condition.

To solve the problem of both sharing and protecting sum withoutusing a lock inside the threaded loop, OpenMP provides the reductionclause that is used to efficiently combine certain associativearithmetical reductions of one or more variables in a loop. Thefollowing loop uses the reduction clause to generate the correctresults.

Given the reduction clause, the compiler creates private copies ofthe variable sum for each thread, and when the loop completes, it addsthe values together and places the result in the original variable sum.

Other reduction operators
Other reduction operators besides “+” exist. Table 6.3 below lists those C++reduction operators specified in the OpenMP standard, along with theinitial values – which are also the mathematical identity value – forthe temporary private variables. You can also find a list of Fortranreduction operators along with their initial values in OpenMPspecification Version 2.5.

Table6.3

For each variable specified in a reduction clause, a private copyis created, one for each thread, as if the private clause is used. Theprivate copy is then initialized to the initialization value for theoperator, as specified in Table 6.3above.

At the end of the region or the loop for which the reduction clausewas specified, the original reduction variable is updated by combiningits original value with the final value of each of the private copies,using the operator specified.

While identifying the opportunities to explore the use of thereduction clause for threading, you should keep the following threepoints in mind:

1) The value of the originalreduction variable becomes undefined when the first thread reaches theregion or loop that specifies the reduction clause and remains so untilthe reduction computation is completed.

2) If the reduction clauseis used on a loop to which the nowait is also applied, the value oforiginal reduction variable remains undefined until a barriersynchronization is performed to ensure that all threads have completedthe reduction.

3) The order in which thevalues are combined is unspecified. Therefore, comparing sequential andparallel runs, even between two parallel runs, does not guarantee thatbit-identical results will be obtained or that side effects, such asfloating-point exceptions, will be identical.

Minimizing Threading Overhead
Using OpenMP, you can parallelize loops, regions, and sections orstraight-line code blocks, whenever dependences do not forbids thembeing executed in parallel.

In addition, because OpenMP employs the simple fork-join executionmodel, it allows the compiler and run-time library to compile and runOpenMP programs efficiently with lower threading overhead. However, youcan improve your application performance by further reducing threadingoverhead.

Table 6.4 below providesmeasured costs of a set of OpenMP constructs and clauses on a 4-wayIntel Xeon processor-based system running at 3.0 gigahertz with theIntel compiler and runtime library. You can see that the cost for eachconstruct or clause is small.

Most of them are less than 7 microseconds except theschedule(dynamic) clause. The schedule (dynamic) clause takes 50microseconds, because its default chunk size is 1, which is too small.

If you use schedule(dynamic,16), its cost is reduced to 5.0microseconds. Note that all measured costs are subject to change if youmeasure these costs on a different processor or under a differentsystem configuration.

The key point is that no matter how well the compiler and runtimeare developed and tuned to minimize the overhead of OpenMP constructsand clauses, you can always find ways to reduce the overhead byexploring the use of OpenMP in a more effective way.

Table6.4

Earlier, you saw how the parallel for pragma could be used to splitthe iterations of a loop across multiple threads. When the compilergenerated thread is executed, the iterations of the loop aredistributed among threads. At the end of the parallel region, thethreads are suspended and they wait for the next parallel region, loop,or sections.

A suspend or resume operation, while significantly lighter weightthan create or terminate operations, still creates overhead and may beunnecessary when two parallel regions, loops, or sections are adjacentas shown in the following example.

The overhead can be removed by entering a parallel region once,then dividing the work within the parallel region. The following codeis functionally identical to the preceding code but runs faster,because the overhead of entering a parallel region is performed onlyonce.

Ideally, all performance-critical portions of the application wouldbe executed within a parallel region. Since very few programs arecomprised only of loops, additional constructs are used to handlenon-loop code. A work-sharing section is one such construct.

Work-sharing Sections
The work-sharing sections construct directs the OpenMP compiler andruntime to distribute the identified sections of your application amongthreads in the team created for the parallel region. The followingexample uses work-sharing for loops and work-sharing sections togetherwithin a single parallel region. In this case, the overhead of forkingor resuming threads for parallel sections is eliminated.

Here, OpenMP first creates several threads. Then, the iterations ofthe loop are divided among the threads. Once the loop is finished, thesections are divided among the threads so that each section is executedexactly once, but in parallel with the other sections.

If the program contains more sections than threads, the remainingsections get scheduled as threads finish their previous sections.Unlike loop scheduling, the schedule clause is not defined forsections.

Therefore, OpenMP is in complete control of how, when, and in whatorder threads are scheduled to execute the sections. You can stillcontrol which variables are shared or private, using the private andreduction clauses in the same fashion as the loop construct.

Next in Part 3: Performance-orientedProgramming
To read Part 1 go to Thechallenges of threading a loop

Thisarticle was excerpted from Multi-CoreProgrammingby Shameem Akhter and Jason Roberts.Copyright © 2006 Intel Corporation. All rights reserved.

Shameem Akhter is a platformarchitect at IntelCorporation, focusing on single socket multi-core architecture andperformance analysis. Jason Roberts is a senior software engineer atIntel, and has worked on a number of different multi-threaded softwareproducts that span a wide range of applications targeting desktop,handheld, and embedded DSP platforms.

To read more onEmbedded.com about the topics discussed in this article, go to “Moreabout multicores, multiprocessors and multithreading.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.