Using OpenMP for programming parallel threads in multicore applications: Part 4

Shameem Akhter and Jason Roberts, Intel Corp.

September 4, 2007

Shameem Akhter and Jason Roberts, Intel Corp.

Another common mistake is uninitialized variables. Remember that private variables do not have initial values upon entering or exiting a parallel construct. Use the firstprivate or lastprivate clauses discussed previously to initialize or copy them. But do so only when necessary because this copying adds overhead.

If you still can't find the bug, perhaps you are working with just too much parallel code. It may be useful to make some sections execute serially, by disabling the parallel code. This will at least identify the location of the bug. An easy way to make a parallel region execute in serial is to use the if clause, which can be added to any parallel construct as shown in the following two examples:

#pragma omp parallel if(0) printf("Executed by thread %d\n", omp_get_thread_num());

#pragma omp parallel for if(0) for ( x = 0; x < 15; x++ ) fn1(x);

In the general form, the if clause can be any scalar expression, like the one shown in the following example that causes serial execution when the number of iterations is less than 16.

#pragma omp parallel for if(n>=16) for ( k = 0; k < n; k++ ) fn2(k);

Another method is to pick the region of the code that contains the bug and place it within a critical section, a single construct, or a master construct. Try to find the section of code that suddenly works when it is within a critical section and fails without the critical section, or executed with a single thread.

The goal is to use the abilities of OpenMP to quickly shift code back and forth between parallel and serial states so that you can identify the locale of the bug. This approach only works if the program does in fact function correctly when run completely in serial mode.

Notice that only OpenMP gives you the possibility of testing code this way without rewriting it substantially. Standard programming techniques used in the Windows API or Pthreads irretrievably commit the code to a threaded model and so make this debugging approach more difficult.

Performance
OpenMP paves a simple and portable way for you to parallelize your applications or to develop threaded applications. The threaded application performance with OpenMP is largely dependent upon the following factors:

* The underlying performance of the single-threaded code.
* The percentage of the program that is run in parallel and its scalability.
* CPU utilization, effective data sharing, data locality and load balancing.
* The amount of synchronization and communication among the threads.
* The overhead introduced to create, resume, manage, suspend, destroy, and synchronize the threads, and made worse by the number of serial-to-parallel or parallel-to-serial transitions.
* Memory conflicts caused by shared memory or falsely shared memory.
* Performance limitations of shared resources such as memory, write combining buffers, bus bandwidth, and CPU execution units.

Essentially, threaded code performance boils down to two issues: 1) how well does the single-threaded version run, and 2) how well can the work be divided up among multiple processors with the least amount of overhead?

Performance always begins with a well-designed parallel algorithm or well-tuned application. The wrong algorithm, even one written in hand-optimized assembly language, is just not a good place to start. Creating a program that runs well on two cores or processors is not as desirable as creating one that runs well on any number of cores or processors.

Remember, by default, with OpenMP the number of threads is chosen by the compiler and runtime library - not you - so programs that work well regardless of the number of threads are far more desirable.

Once the algorithm is in place, it is time to make sure that the code runs efficiently on the Intel Architecture and a single-threaded version can be a big help.

By turning off the OpenMP compiler option you can generate a single-threaded version and run it through the usual set of optimizations. A good reference for optimizations is The Software Optimization Cookbook (Gerber 2006). Once you have gotten the single-threaded performance that you desire, then it is time to generate the multithreaded version and start doing some analysis.

First look at the amount of time spent in the operating system's idle loop. The Intel VTune Performance Analyzer is great tool to help with the investigation. Idle time can indicate unbalanced loads, lots of blocked synchronization, and serial regions.

Fix those issues, then go back to the VTune Performance Analyzer to look for excessive cache misses and memory issues like false-sharing. Solve these basic problems, and you will have a well-optimized parallel program that will run well on multi-core systems as well as multiprocessor SMP systems.

Optimizations are really a combination of patience, trial and error, and practice. Make little test programs that mimic the way your application uses the computer's resources to get a feel for what things are faster than others. Be sure to try the different scheduling clauses for the parallel sections.

Key Points
Keep the following key points in mind while programming with OpenMP:

* The OpenMP programming model provides an easy and portable way to parallelize serial code with an OpenMP-compliant compiler.

* OpenMP consists of a rich set of pragmas, environment variables, and a runtime API for threading.

* The environment variables and APIs should be used sparingly because they can affect performance detrimentally. The pragmas represent the real added value of OpenMP.

* With the rich set of OpenMP pragmas, you can incrementally parallelize loops and straight-line code blocks such as sections without re-architecting the applications. The Intel Task queuing extension makes OpenMP even more powerful in covering more application domain for threading.

* If your application's performance is saturating a core or processor, threading it with OpenMP will almost certainly increase the application's performance on a multi-core or multiprocessor system.

* You can easily use pragmas and clauses to create critical sections, identify private and public variables, copy variable values, and control the number of threads operating in one section.

* OpenMP automatically uses an appropriate number of threads for the target system so, where possible, developers should consider using OpenMP to ease their transition to parallel code and to make their programs more portable and simpler to maintain. Native and quasi-native options, such as the Windows threading API and Pthreads, should be considered only when this is not possible.

To read Part 1 go to The challenges of threading a loop
To read Part 2, go to Managing Shared and Private Data
To read Part 3, go to Performance-oriented programing

This article was excerpted from Multi-Core Programming by Shameem Akhter and Jason Roberts. Copyright © 2006 Intel Corporation. All rights reserved.

Shameem Akhter is a platform architect at Intel Corporation, focusing on single socket multi-core architecture and performance analysis. Jason Roberts is a senior software engineer at Intel, and has worked on a number of different multi-threaded software products that span a wide range of applications targeting desktop, handheld, and embedded DSP platforms.

To read more on Embedded.com about the topics discussed in this article, go to "More about multicores, multiprocessors and multithreading."


< Previous
Page 2 of 2
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER