Using OpenMP for programming parallel threads in multicore applications: Part 4 - Embedded.com

Using OpenMP for programming parallel threads in multicore applications: Part 4

In addition to the pragmas discussed earlier in this series that makeparallel programming a bit easier, OpenMP provides a set of functionscalls and environment variables. So far, only the pragmas have beendescribed. The pragmas are the key to OpenMP because they provide thehighest degree of simplicity and portability, and the pragmas can beeasily switched off to generate a non-threaded version of the code.

In contrast, the OpenMP function calls require you to add theconditional compilation in your programs as shown below, in case youwant to generate a serial version.

#include #
ifdef _OPENMP
    omp_set_num_threads(4);
#endif

When in doubt, always try to use the pragmas and keep the functioncalls for the times when they are absolutely necessary. To use thefunction calls, include the header file. The compilerautomatically links to the correct libraries.

The four most heavily used OpenMP library functions are shown inTable 6.5 below. They retrievethe total number of threads, set the number ofthreads, return the current thread number, and return the number ofavailable cores, logical processors or physical processors,respectively. To view the complete list of OpenMP library functions,please see the OpenMP Specification Version 2.5, which is availablefrom OpenMP web site at www.openmp.org.

Table6.5 The Most Heavily Used OpenMP Library Functions

Figure 6.2 below uses thesefunctions to perform data processing for each element in array x. Thisexample illustrates a few important concepts when using the functioncalls instead of pragmas. First, your code must be rewritten, and withany rewrite comes extra documentation, debugging, testing, andmaintenance work. Second, it becomes difficult or impossible to compilewithout OpenMP support.

Finally, because thread values have been hard coded, you lose theability to have loop-scheduling adjusted for you, and this threadedcode is not scalable beyond four cores or processors, even if you havemore than four cores or processors in the system.

Figure6.2 Loop that Uses OpenMP Functions and Illustrates the Drawbacks

OpenMP Environment Variables
The OpenMP specification defines a few environment variables.Occasionally the two shown in Table 6.6 may be useful duringdevelopment.

Additional compiler-specific environment variables are usuallyavailable. Be sure to review your compiler's documentation to becomefamiliar with additional variables.

Table6.6 Most Commonly Used Environment Variables for OpenMP

CompilationUsing the OpenMP pragmas requires an OpenMP- compatible compiler andthread-safe runtime libraries. The Intel C++ Compiler version 7.0 orlater and the Intel Fortran compiler both support OpenMP on Linux andWindows. This discussion of compilation and debugging will focus onthese compilers.

Several other choices are available as well, for instance, Microsoftsupports OpenMP in Visual C++ 2005 for Windows and the Xbox 360platform, and has also made OpenMP work with managed C++ code. Inaddition, OpenMP compilers for C/C++ and Fortran on Linux and Windowsare available from the Portland Group.

The /Qopenmp command-line option given to the Intel C++ Compilerinstructs it to pay attention to the OpenMP pragmas and to createmultithreaded code. If you omit this switch from the command line, thecompiler will ignore the OpenMP pragmas.

This action provides a very simple way to generate a single-threadedversion without changing any source code. Table 6.7 below provides a summaryof invocation options for using OpenMP. The thread-safe runtimelibraries are selected and linked automatically when the OpenMP relatedcompilation switch is used.

Table6.6 Most Commonly Used Environment Variables for OpenMP

The Intel compilers support the OpenMP Specification Version 2.5except the workshare construct. Be sure to browse the release notes andcompatibility information supplied with the compiler for the latestinformation. The complete OpenMP specification is available from theOpenMP Web site.

Debugging
Debugging multithreaded applications has always been a challenge due tothe nondeterministic execution of multiple instruction streams causedby runtime thread-scheduling and context switching.

Also, debuggers may change the runtime performance and threadscheduling behaviors, which can mask race conditions and other forms ofthread interaction. Even print statements can mask issues because theyuse synchronization and operating system functions to guaranteethread-safety.

Debugging an OpenMP program adds some difficulty, as OpenMPcompilers must communicate all the necessary information of privatevariables, shared variables, threadprivate variables, and all kinds ofconstructs to debuggers after threaded code generation; additional codethat is impossible to examine and step through without a specializedOpenMP-aware debugger.

Therefore, the key is narrowing down the problem to a small codesection that causes the same problem. It would be even better if youcould come up with a very small test case that can reproduce theproblem. The following list provides guidelines for debugging OpenMPprograms:

1. Use the binary searchmethod to identify the parallel construct causing the failure byenabling and disabling the OpenMP pragmas in the program.

2. Compile the routinecausing problem with no /Qopenmp switch and with /Qopenmp_stubs switch;then you can check if the code fails with a serial run, if so, it is aserial code debugging. If not, go toStep 3.

3. Compile the routinecausing problem with /Qopenmp switch and set the environment variableOMP_NUM_THREADS=1; then you can check if the threaded code fails with aserial run. If so, it is a single-thread code debugging of threadedcode. If not, go to Step 4 .

4. Identify the failingscenario at the lowest compiler optimization level by compiling it with/Qopenmp and one of the switches such as /Od, /O1, /O2, /O3, and/or/Qipo.

5 . Examine the code sectioncausing the failure and look for problems such as violation of datadependence after paralleliza-tion, race conditions, deadlock, missingbarriers, and uninitialized variables. If you can not spot any problem,go to Step 6.

6 . Compile the code using/Qtcheck to perform the OpenMP code instrumentation and run theinstrumented code inside the Intel Thread Checker.

Problems are often due to race conditions. Most race conditions arecaused by shared variables that really should have been declaredprivate, reduction, or threadprivate.

Sometimes, race conditions are also caused by missing necessarysynchronization such as critica and atomic protection of updatingshared variables. Start by looking at the variables inside the parallelregions and make sure that the variables are declared private whennecessary. Also, check functions called within parallel constructs.

By default, variables declared on the stack are private but theC/C++ keyword static changes the variable to be placed on the globalheap and therefore the variables are shared for OpenMP loops.

The default(none) clause, shown in the following code sample, can beused to help find those hard-to-spot variables. If you specifydefault(none), then every variable must be declared with a data-sharingattribute clause.

#pragma omp parallel fordefault(none) private(x,y) shared(a,b)
Another common mistake is uninitialized variables. Remember thatprivate variables do not have initial values upon entering or exiting aparallel construct. Use the firstprivate or lastprivate clausesdiscussed previously to initialize or copy them. But do so only whennecessary because this copying adds overhead.

If you still can't find the bug, perhaps you are working with justtoo much parallel code. It may be useful to make some sections executeserially, by disabling the parallel code. This will at least identifythe location of the bug. An easy way to make a parallel region executein serial is to use the if clause, which can be added to any parallelconstruct as shown in the following two examples:

#pragma omp parallel if(0)printf(“Executed by thread %dn”, omp_get_thread_num());

#pragma omp parallel for if(0) for( x = 0; x < 15; x++ ) fn1(x);

In the general form, the if clause can be any scalar expression,like the one shown in the following example that causes serialexecution when the number of iterations is less than 16.

#pragma omp parallel forif(n>=16) for ( k = 0; k < n; k++ ) fn2(k);

Another method is to pick the region of the code that contains thebug and place it within a critical section, a single construct, or amaster construct. Try to find the section of code that suddenly workswhen it is within a critical section and fails without the criticalsection, or executed with a single thread.

The goal is to use the abilities of OpenMP to quickly shift codeback and forth between parallel and serial states so that you canidentify the locale of the bug. This approach only works if the programdoes in fact function correctly when run completely in serial mode.

Notice that only OpenMP gives you the possibility of testing codethis way without rewriting it substantially. Standard programmingtechniques used in the Windows API or Pthreads irretrievably commit thecode to a threaded model and so make this debugging approach moredifficult.

Performance
OpenMP paves a simple and portable way for you to parallelize yourapplications or to develop threaded applications. The threadedapplication performance with OpenMP is largely dependent upon thefollowing factors:

* The underlying performance of the single-threaded code.
* The percentage of the program that is run in parallel and itsscalability.
* CPU utilization, effective data sharing, data locality and loadbalancing.
* The amount of synchronization and communication among the threads.
* The overhead introduced to create, resume, manage, suspend, destroy,and synchronize the threads, and made worse by the number ofserial-to-parallel or parallel-to-serial transitions.
* Memory conflicts caused by shared memory or falsely shared memory.
* Performance limitations of shared resources such as memory, writecombining buffers, bus bandwidth, and CPU execution units.

Essentially, threaded code performance boils down to two issues: 1) how well does the single-threadedversion run, and 2) how wellcan the work be divided up among multiple processors with the leastamount of overhead?

Performance always begins with a well-designed parallel algorithm orwell-tuned application. The wrong algorithm, even one written inhand-optimized assembly language, is just not a good place to start.Creating a program that runs well on two cores or processors is not asdesirable as creating one that runs well on any number of cores orprocessors.

Remember, by default, with OpenMP the number of threads is chosen bythe compiler and runtime library – not you – so programs that work wellregardless of the number of threads are far more desirable.

Once the algorithm is in place, it is time to make sure that thecode runs efficiently on the Intel Architecture and a single-threadedversion can be a big help.

By turning off the OpenMP compiler option you can generate asingle-threaded version and run it through the usual set ofoptimizations. A good reference for optimizations is The SoftwareOptimization Cookbook (Gerber 2006 ).Once you have gotten the single-threaded performance that you desire,then it is time to generate the multithreaded version and start doingsome analysis.

First look at the amount of time spent in the operating system'sidle loop. The Intel VTune Performance Analyzer is great tool to helpwith the investigation. Idle time can indicate unbalanced loads, lotsof blocked synchronization, and serial regions.

Fix those issues, then go back to the VTune Performance Analyzer tolook for excessive cache misses and memory issues like false-sharing.Solve these basic problems, and you will have a well-optimized parallelprogram that will run well on multi-core systems as well asmultiprocessor SMP systems.

Optimizations are really a combination of patience, trial and error,and practice. Make little test programs that mimic the way yourapplication uses the computer's resources to get a feel for what thingsare faster than others. Be sure to try the different scheduling clausesfor the parallel sections.

Key Points
Keep the following key points in mind while programming with OpenMP:

* The OpenMP programming model provides an easy and portable way toparallelize serial code with an OpenMP-compliant compiler.

* OpenMP consists of a rich set of pragmas, environment variables,and a runtime API for threading.

* The environment variables and APIs should be used sparinglybecause they can affect performance detrimentally. The pragmasrepresent the real added value of OpenMP.

* With the rich set of OpenMP pragmas, you can incrementallyparallelize loops and straight-line code blocks such as sectionswithout re-architecting the applications. The Intel Task queuingextension makes OpenMP even more powerful in covering more applicationdomain for threading.

* If your application's performance is saturating a core orprocessor, threading it with OpenMP will almost certainly increase theapplication's performance on a multi-core or multiprocessor system.

* You can easily use pragmas and clauses to create criticalsections, identify private and public variables, copy variable values,and control the number of threads operating in one section.

* OpenMP automatically uses an appropriate number of threads for thetarget system so, where possible, developers should consider usingOpenMP to ease their transition to parallel code and to make theirprograms more portable and simpler to maintain. Native and quasi-nativeoptions, such as the Windows threading API and Pthreads, should beconsidered only when this is not possible.

To read Part 1 go to Thechallenges of threading a loop
To read Part 2, go to Managing Shared andPrivate Data
To read Part 3, go to Performance-orientedprograming

Thisarticle was excerpted from Multi-CoreProgrammingby Shameem Akhter and Jason Roberts.Copyright © 2006 Intel Corporation. All rights reserved.

ShameemAkhter is a platform architect at Intel Corporation, focusing on singlesocket multi-core architecture and performance analysis. Jason Roberts is a senior softwareengineer at Intel, and has worked on a number of differentmulti-threaded software products that span a wide range of applicationstargeting desktop, handheld, and embedded DSP platforms.

To read more onEmbedded.com about the topics discussed in this article, go to “Moreabout multicores, multiprocessors and multithreading.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.