Using OpenMP for programming parallel threads in multicore applications: Part 3 - Embedded.com

Using OpenMP for programming parallel threads in multicore applications: Part 3

OpenMP provides a set of important pragmas and runtime functions thatenable thread synchronization and related actions to facilitate correctparallel programming. Using these pragmas and runtime functionseffectively with minimum overhead and thread waiting time is extremelyimportant for achieving optimal performance from your applications.

Using Barrier and Nowait
Barriers are a form of synchronization method that OpenMP employs tosynchronize threads. Threads will wait at a barrier until all thethreads in the parallel region have reached the same point.

You have been using implied barriers without realizing it in thework-sharing for and work-sharing sections constructs. At the end ofthe parallel, for, sections, and single constructs, an implicit barrieris generated by the compiler or invoked in the runtime library.

The barrier causes execution to wait for all threads to finish thework of the loop, sections, or region before any go on to executeadditional work. This barrier can be removed with the nowait clause, asshown in the following code sample.

In this example, since data is not dependent between the firstwork-sharing for loop and the second work-sharing sections code block,the threads that process the first work-sharing for loop continueimmediately to the second work-sharing sections without waiting for allthreads to finish the first loop.

Depending upon your situation, this behavior may be beneficial,because it can make full use of available resources and reduce theamount of time that threads are idle. The nowait clause can also beused with the work-sharing sections construct and single construct toremove its implicit barrier at the end of the code block.

Adding an explicit barrier is also supported by OpenMP as shown inthe following example through the barrier pragma.

In this example, the OpenMP code is to be executed by two threads;one thread writes the result to the variable y, and another threadwrites the result to the variable z. Both y and z are read in thework-sharing for loop, hence, two flow dependences exist.

In order to obey the data dependence constraints in the code forcorrect threading, you need to add an explicit barrier pragma rightbefore the work-sharing for loop to guarantee that the value of both yand z are ready for read.

In real applications, the barrier pragma is especially useful whenall threads need to finish a task before any more work can becompleted, as would be the case, for example, when updating a graphicsframe buffer before displaying its contents.

Interleaving Single-thread andMulti-thread Execution
In large real-world applications, a program may consist of both serialand parallel code segments due to various reasons such as datadependence constraints and I/O operations.

A need to execute something only once by only one thread willcertainly be required within a parallel region, especially because youare making parallel regions as large as possible to reduce overhead.

To handle the need for single-thread execution, OpenMP provides away to specify that a sequence of code contained within a parallelsection should only be executed one time by only one thread.

The OpenMP runtime library decides which single thread will do theexecution. If need be, you can specify that you want only the masterthread, the thread that actually started the program execution, toexecute the code, as in the following example.

As can be seen from the comments in this code, a remarkable amountof synchronization and management of thread execution is available in acomparatively compact lexicon of pragmas.

Note that all low-level details are taken care of by the OpenMPcompiler and runtime. What you need to focus on is to specify parallelcomputation and synchronization behaviors you expected for correctnessand performance.

In other words, using single and master pragmas along with thebarrier pragma and nowait clause in a clever way, you should be able tomaximize the scope of a parallel region and the overlap of computationsto reduce threading overhead effectively, while obeying all datadependences and I/O constraints in your programs.

Data Copy-in and Copy-out
When you parallelize a program, you would normally have to deal withhow to copy in the initial value of a private variable to initializeits private copy for each thread in the team.

You would also copy out the value of the private variable computedin the last iteration/section to its original variable for the masterthread at the end of parallel region.

OpenMP standard provides four clauses  – firstprivate,lastprivate, copyin, and copyprivate – for you to accomplish the datacopy-in and copy-out operations whenever necessary based on yourprogram and parallelization scheme. The following descriptionssummarize the semantics of these four clauses:

* firstprivate provides a way to initialize the value of a private variable for eachthread with the value of variable from the master thread. Normally,temporary private variables have an undefined initial value saving theperformance overhead of the copy.

* lastprivate provides a way to copy out the value of the private variable computedin the last iteration/section to the copy of the variable in the masterthread. Variables can be declared both firstprivate and lastprivate atthe same time.

* copyin provides a way to copy the master thread's threadprivate variable tothe threadprivate variable of each other member of the team executingthe parallel region.

* copyprivate provides a way to use a private variable to broadcast a value from onemember of threads to other members of the team executing the parallelregion. The copyprivate clause is allowed to associate with the singleconstruct; the broadcast action is completed before any of threads inthe team left the barrier at the end of construct.

Considering the code example, let's see how it works. The followingcode converts a color image to black and white.

The issue is how to move the pointers pGray and pRGB to the correctplace within the bitmap while threading the outer “row” loop. Theaddress computation for each pixel can be done with the following code:

pDestLoc = pGray + col + row *GrayStride;
pSrcLoc = pRGB + col + row * RGBStride;

The above code, however, executes extra math on each pixel for theaddress computation. Instead, the firstprivate clause can be used toperform necessary initialization to get the initial address of pointerpGray and pRGB for each thread.

You may notice that the initial addresses of the pointer pGray andpRGB have to be computed only once based on the “row” number and theirinitial addresses in the master thread for each thread; the pointerpGray and pRGB are induction pointers and updated in the outer loop foreach “row” iteration.

This is the reason the bool-type variable doInit is introduced withan initial value TRUE to make sure the initialization is done only oncefor each to compute the initial address of pointer pGray and pRGB. Theparallelized code follows:

If you take a close look at this code, you may find that the fourvariables GrayStride, RGBStride, height, and width are read-onlyvariables. In other words, no write operation is performed to thesevariables in the parallel loop. Thus, you can also specify them on theparallel for loop by adding the code below:

firstprivate (GrayStride,RGBStride, height, width)

You may get better performance in some cases, as the privatizationhelps the compiler to perform more aggressive registerization and codemotion as their loop invariants reduce memory traffic.

Protecting Updates of SharedVariables
The critical and atomic pragmas are supported by the OpenMP standardfor you to protect the updating of shared variables for avoidingdata-race conditions. The code block enclosed by a critical section andan atomic pragma are areas of code that may be entered only when noother thread is executing in them. The following example uses anunnamed critical section.

#pragma omp critical
{
    if ( max
}

Global, or unnamed, critical sections will likely and unnecessarilyaffect performance because every thread is competing to enter the sameglobal critical section, as the execution of every thread isserialized. This is rarely what you want.

For this reason, OpenMP offers named critical sections. Namedcritical sections enable fine-grained synchronization, so only thethreads that need to block on a particular section will do so.

The following example shows the code that improves the previousexample. In practice, named critical sections are used when more thanone thread is competing for more than one critical resource.

#pragma omp critical(maxvalue)
{
    if ( max
}

With named critical sections, applications can have multiplecritical sections, and threads can be in more than one critical sectionat a time. It is important to remember that entering nested criticalsections runs the risk of deadlock. The following code example codeshows a deadlock situation:

In the previous code, the dynamically nested critical sections areused. When the function do_work is called inside a parallel loop,multiple threads compete to enter the outer critical section.

The thread that succeeds in entering the outer critical section willcall the dequeue function; however, the dequeue function cannot makeany further progress, as the inner critical section attempts to enterthe same critical section in the do_work function.

Thus, the do_work function could never complete. This is a deadlocksituation. The simple way to fix the problem in the previous code is todo the inlining of the dequeue function in the do_work function asfollows:

When using multiple critical sections, be very careful to examinecritical sections that might be lurking in subroutines. In addition tousing critical sections, you can also use the atomic pragma forupdating shared variables.

When executing code in parallel, it is impossible to know when anoperation will be interrupted by the thread scheduler. However,sometimes you may require that a statement in a high-level languagecomplete in its entirety before a thread is suspended. For example, astatement x++ is translated into a sequence of machine instructionssuch as:

load reg, [x];
add reg 1;
store [x], reg;

It is possible that the thread is swapped out between two of thesemachine instructions. The atomic pragma directs the compiler togenerate code to ensure that the specific memory storage is updatedatomically. The following code example shows a usage of the atomicpragma.

An expression statement that is allowed to use the atomic pragmamust be with one of the following forms:

x binop = expr
x++
++x
x —
— x

In the preceding expressions, x is an lvalue expression with scalartype; expr is an expression with scalar type and does not reference theobject designed by x; binop is not an overloaded operator and is one of+, *, -, /, &, ^, <<, or>> for the C/C++language.

It is worthwhile to point out that in the preceding code example,the advantage of using the atomic pragma is that it allows update oftwo different elements of array y to occur in parallel. If a criticalsection were used instead, then all updates to elements of array ywould be executed serially, but not in a guaranteed order.

Furthermore, in general, the OpenMP compiler and runtime libraryselect the most efficient method to implement the atomic pragma givenoperating system features and hardware capabilities. Thus, whenever itis possible you should use the atomic pragma before using the criticalsection in order to avoid data-race conditions on statements thatupdate a shared memory location.

Intel Taskqueuing Extension toOpenMP
The Intel Taskqueuing extension to OpenMP allows a programmer toparallelize control structures such as recursive function, dynamic-treesearch, and pointer-chasing while loops that are beyond the scope ofthose supported by the current OpenMP model, while still fitting intothe framework defined by the OpenMP specification.

In particular, the taskqueuing model is a flexible programming modelfor specifying units of work that are not pre-computed at the start ofthe work-sharing construct. Take a look the following example.

The parallel taskq pragma directs the compiler to generate code tocreate a team of threads and an environment for the while loop toenqueue the units of work specified by the enclosed task pragma.

The loop's control structure and the enqueuing are executed by onethread, while the other threads in the team participate in dequeuingthe work from the taskq queue and executing it.

The captureprivate clause ensures that a private copy of the linkpointer p is captured at the time each task is being enqueued, hencepreserving the sequential semantics. The taskqueuing execution model isshown in Figure 6.1 below.

Figure6.1 Taskqueuing Execution Model

Essentially, for any given program with parallel taskq constructs,a team of threads is created by the runtime library when the mainthread encounters a parallel region.

The runtime thread scheduler chooses one thread TK to executeinitially from all the threads that encounter a taskq pragma. All theother threads wait for work to be put on the task queue. Conceptually,the taskq pragma triggers this sequence of actions:

1. Causes an empty queue tobe created by the chosen thread TK
2. Enqueues each task that itencounters
3. Executes the code inside thetaskq block as a single thread

The task pragma specifies a unit of work, potentially to be executedby a different thread. When a task pragma is encountered lexicallywithin a taskq block, the code inside the task block is placed on thequeue associated with the taskq pragma.

The conceptual queue is disbanded when all work enqueued on itfinishes and the end of the taskq block is reached. The Intel C++compiler has been extended throughout its various components to supportthe taskqueuing model for generating multithreaded codes correspondingto taskqueuing constructs.

Next in Part 4: Library functions,compilation and debugging
To read Part 1 go to Thechallenges of threading a loop
To read Part 2, go to Managing Shared andPrivate Data

Thisarticle was excerpted from Multi-CoreProgrammingby Shameem Akhter and Jason Roberts.Copyright © 2006 Intel Corporation. All rights reserved.

ShameemAkhter is a platformarchitect at IntelCorporation, focusing on single socket multi-core architecture andperformance analysis. Jason Roberts isa senior software engineer atIntel, and has worked on a number of different multi-threaded softwareproducts that span a wide range of applications targeting desktop,handheld, and embedded DSP platforms.

To read more onEmbedded.com about the topics discussed in this article, go to “Moreabout multicores, multiprocessors and multithreading.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.