By Shameem Akhter and Jason Roberts, Intel Corp.
Another common mistake is uninitialized variables. Remember that
private variables do not have initial values upon entering or exiting a
parallel construct. Use the firstprivate or lastprivate clauses
discussed previously to initialize or copy them. But do so only when
necessary because this copying adds overhead.
If you still can't find the bug, perhaps you are working with just
too much parallel code. It may be useful to make some sections execute
serially, by disabling the parallel code. This will at least identify
the location of the bug. An easy way to make a parallel region execute
in serial is to use the if clause, which can be added to any parallel
construct as shown in the following two examples:
#pragma omp parallel if(0)
printf("Executed by thread %d\n", omp_get_thread_num());
#pragma omp parallel for if(0) for
( x = 0; x < 15; x++ ) fn1(x);
In the general form, the if clause can be any scalar expression,
like the one shown in the following example that causes serial
execution when the number of iterations is less than 16.
#pragma omp parallel for
if(n>=16) for ( k = 0; k < n; k++ ) fn2(k);
Another method is to pick the region of the code that contains the
bug and place it within a critical section, a single construct, or a
master construct. Try to find the section of code that suddenly works
when it is within a critical section and fails without the critical
section, or executed with a single thread.
The goal is to use the abilities of OpenMP to quickly shift code
back and forth between parallel and serial states so that you can
identify the locale of the bug. This approach only works if the program
does in fact function correctly when run completely in serial mode.
Notice that only OpenMP gives you the possibility of testing code
this way without rewriting it substantially. Standard programming
techniques used in the Windows API or Pthreads irretrievably commit the
code to a threaded model and so make this debugging approach more
difficult.
Performance
OpenMP paves a simple and portable way for you to parallelize your
applications or to develop threaded applications. The threaded
application performance with OpenMP is largely dependent upon the
following factors:
* The underlying performance of the single-threaded code.
* The percentage of the program that is run in parallel and its
scalability.
* CPU utilization, effective data sharing, data locality and load
balancing.
* The amount of synchronization and communication among the threads.
* The overhead introduced to create, resume, manage, suspend, destroy,
and synchronize the threads, and made worse by the number of
serial-to-parallel or parallel-to-serial transitions.
* Memory conflicts caused by shared memory or falsely shared memory.
* Performance limitations of shared resources such as memory, write
combining buffers, bus bandwidth, and CPU execution units.
Essentially, threaded code performance boils down to two issues: 1) how well does the single-threaded
version run, and 2) how well
can the work be divided up among multiple processors with the least
amount of overhead?
Performance always begins with a well-designed parallel algorithm or
well-tuned application. The wrong algorithm, even one written in
hand-optimized assembly language, is just not a good place to start.
Creating a program that runs well on two cores or processors is not as
desirable as creating one that runs well on any number of cores or
processors.
Remember, by default, with OpenMP the number of threads is chosen by
the compiler and runtime library - not you - so programs that work well
regardless of the number of threads are far more desirable.
Once the algorithm is in place, it is time to make sure that the
code runs efficiently on the Intel Architecture and a single-threaded
version can be a big help.
By turning off the OpenMP compiler option you can generate a
single-threaded version and run it through the usual set of
optimizations. A good reference for optimizations is The Software
Optimization Cookbook (Gerber 2006).
Once you have gotten the single-threaded performance that you desire,
then it is time to generate the multithreaded version and start doing
some analysis.
First look at the amount of time spent in the operating system's
idle loop. The Intel VTune Performance Analyzer is great tool to help
with the investigation. Idle time can indicate unbalanced loads, lots
of blocked synchronization, and serial regions.
Fix those issues, then go back to the VTune Performance Analyzer to
look for excessive cache misses and memory issues like false-sharing.
Solve these basic problems, and you will have a well-optimized parallel
program that will run well on multi-core systems as well as
multiprocessor SMP systems.
Optimizations are really a combination of patience, trial and error,
and practice. Make little test programs that mimic the way your
application uses the computer's resources to get a feel for what things
are faster than others. Be sure to try the different scheduling clauses
for the parallel sections.
Key Points
Keep the following key points in mind while programming with OpenMP:
* The OpenMP programming model provides an easy and portable way to
parallelize serial code with an OpenMP-compliant compiler.
* OpenMP consists of a rich set of pragmas, environment variables,
and a runtime API for threading.
* The environment variables and APIs should be used sparingly
because they can affect performance detrimentally. The pragmas
represent the real added value of OpenMP.
* With the rich set of OpenMP pragmas, you can incrementally
parallelize loops and straight-line code blocks such as sections
without re-architecting the applications. The Intel Task queuing
extension makes OpenMP even more powerful in covering more application
domain for threading.
* If your application's performance is saturating a core or
processor, threading it with OpenMP will almost certainly increase the
application's performance on a multi-core or multiprocessor system.
* You can easily use pragmas and clauses to create critical
sections, identify private and public variables, copy variable values,
and control the number of threads operating in one section.
* OpenMP automatically uses an appropriate number of threads for the
target system so, where possible, developers should consider using
OpenMP to ease their transition to parallel code and to make their
programs more portable and simpler to maintain. Native and quasi-native
options, such as the Windows threading API and Pthreads, should be
considered only when this is not possible.
To read Part 1 go to The
challenges of threading a loop
To read Part 2, go to Managing Shared and
Private Data
To read Part 3, go to Performance-oriented
programing
This
article was excerpted from Multi-Core
Programming by Shameem Akhter and Jason Roberts.
Copyright © 2006 Intel Corporation. All rights reserved.
Shameem
Akhter is a platform architect at Intel Corporation, focusing on single
socket multi-core architecture and performance analysis. Jason Roberts is a senior software
engineer at Intel, and has worked on a number of different
multi-threaded software products that span a wide range of applications
targeting desktop, handheld, and embedded DSP platforms.
To read more on
Embedded.com about the topics discussed in this article, go to "More
about multicores, multiprocessors and multithreading."