Threading through the parallel code optimization maze: Part 2 -

Threading through the parallel code optimization maze: Part 2

As discussed earlier in Part 1 in this series, embedded application development is aided by two components, the right development tools and the right development process. In this part in the series, I will detail a threading development process that can aid developers employing threads in their application.

The Threading Development Cycle (TDC) is iterative and concludes when the application requirements have been met. Figure 6.6 below depicts the TDC which is summarized by these four steps, covered in detail in this part in the series:

1. Analysis : determining which portion of the application to thread.
2. Design : deciding how to thread and implementing.
3. Debug : finding and fixing thread-related stability issues in the code.
4. Tune : optimizing for thread performance issues.
The next four sections provide details on each of the above steps.

Figure 6.6 : Threading development cycle

Thread Analysis
The first step of the TDC is analysis, which seeks to determine which portions of the code to thread. Threading should be applied to the portions of the application that execute the most frequently and will thus lead to the largest performance improvement.

Alternatively, there is no point in threading portions of the code that are never or rarely executed in practice. So the question of where to thread is important. The analysis phase can be broken down into the following steps:

1. Develop benchmark that is representative of the application
2. Tune for serial performance
3. Obtain an execution time profile
4. Identify top regions of code and top loops
5. Obtain a call graph profile
6. Create a flow chart of the critical algorithms

Benchmarking . Before the question of where to thread can be answered, a reproducible workload, or benchmark, of the application should be created. In order to conduct the analysis and successive steps of the TDC, the benchmark should have a number of critical qualities summarized as follows:

1) Representative of common usage patterns
2) Reproducible
3) Simple to execute
4) Reduced time to execute

First, the benchmark should represent typical usage patterns of the application by the user. For example, the benchmark should not be a regression test suite designed to test every possible code path through the application.

Second, steps to exercise the benchmark should be reproducible so that the effects of the threading implementation can be analyzed for both stability and performance.

Third, the benchmark should be simple to execute to minimize the chance for human error. Finally, the benchmark should complete in a short duration which enables quick turnaround on performance experiments.

Tune for Serial Performance . Serial tuning is less labor intensive and can lead to dramatic performance improvements that can enhance the benefits of parallel execution if done properly. There is a caveat; some serial optimization can limit the ability to parallelize. The best way to consider if over-optimization may inhibit parallelism is to determine if your optimizations are adding the elements that inhibit parallelism defined earlier.

Collect an Execution Time Profile . An execution time profile is just as critical for threading as it is for standard scalar optimization. The execution time profile is a report on the areas of highest execution of an application. Threading these frequently executed regions of code typically provides the largest performance improvement.

The benefits of obtaining the execution time profile is that with information it is possible to calculate theoretical speedup as a result of parallel execution. For example, suppose an application has the execution time profile shown in Table 6.2 below .

Table 6.2 : Execution time profile example

Fifty percent of the execution time occurred in the function, heavy_compute2() . If you planned to execute the application on a machine with four processor cores and were able to achieve linear scaling, the expected speedup for the entire application is 1.6. This estimate is helpful in understanding the return on investment of the effort.

If the project requirements could be met by threading heavy_compute2() , you may choose not to thread other regions. If threading of heavy_compute2() did not allow you to reach your project requirements, you could consider threading of heavy_compute1() and heavy_ compute3() .

In addition, you can also communicate what speedup is reasonably possible. If management demanded a 300% speedup, but did not allocate the resources necessary to thread heavy_compute1() and heavy_compute3() , you can clearly communicate the infeasibility of the plan.

Collect a Call Graph Profile . A call graph profile helps determine where in the application to introduce threads. A call graph shows relationships between a function (caller) and the function it calls (callee). A typical application can contain thousands of functions with many levels or nests of calls and as a result a call graph of the application can be very complex.

A useful call graph will contain additional information such as the number of times a function is called, the amount of time spent in the function, and calculations of the most time-consuming paths through the call graph. A call graph benefits analysis of your application in two areas:

1. Easily identify portions of the application that may be amenable to threading via functional decomposition.

2. Gain a better understanding of the affects that a threading decision in one area has on the larger application.

Figure 6.7 below is a sample call graph created by profiling code similar to the code in Figure 6.1 . One possibility observed from the call graph is that the three functions called from main may be candidates for parallel execution.

Figure 6.7 : Sample call graph

On the second point, a call graph provides an idea of which functions could be affected by the area chosen to be threaded. For example, if you decided to thread a routine that called several other functions which in turn called many others, it may be good to do a cursory analysis of the variables used in the routines to determine if access to shared variables needs to be synchronized. Many synchronization issues could be prevented upfront which saves time during the debug phase.

Flow Chart Hotspots . A flow chart of hotspots is useful in cases where the call graph does not reveal the functional flow of algorithm steps with enough granularity. Furthermore, in flow charting the algorithm, it should become clear if functional or data decomposition should be employed.

Producing a flow chart is a time-intensive activity that should be employed on only the hotspots. Figure 6.8 below shows a sample flow chart of, one of the SPEC CPU2000 [4] benchmarks that implements a pattern recognition application employing an Adaptive Resonance II [5] neural network, an algorithm which is very amenable to data decomposition. In fact, a threaded version of the pattern recognition application is used in the SPEC OMP [6] benchmarks and is known as 330.art_m and 331.art_l .

Figure 6.8: Flow chart of scanner application

The OpenMP version of the benchmark [7] is parallelized based upon data decomposition of the outer loop shown in Figure 6.8 . The benefit of the flow chart is that it visually supplements knowledge of frequently executed loops and a call graph into a diagram that aids in your multi-thread design.

Classify Most Frequently Executed Loops . The next step in analysis is to classify the most frequently executed loops in the application. Loop level parallelism via data decomposition is typically an easier undertaking than other types of parallelization and should be considered first.

The term, top loop , is relative to the application and your performance needs, but typically 10% or more of the execution time is worthy of consideration. In analyzing loops for threading focus on both innermost loops with execution time concentrated in a small area and loops with large bodies and potentially many nested loops within.

The initial focus on both small and large loops is a difference compared to typical performance analysis which focuses first on tight innermost loops. A greater variety of loops is considered because the cost of spawning threads should be amortized at the correct level or it is possible to realize less than ideal benefits from threading.

For example, consider the code in Figure 6.9 below . If the function, perform_task() , was threaded using Pthreads, the overhead of the thread creation and thread deletion would be contained inside the function call and executed every loop iteration. The thread overhead would negatively impact performance.

Figure 6.9: Sample code before thread pool

A better approach is detailed in Figure 6.10 below using a thread pool which incurs the overhead of thread creation and deletion only once over the iterations of the loop.

The thread pool used in this example follows Microsoft syntax; however, the thread pool pattern is not unique to it. Previous to the call to QueueUserWorkItem, a function call to setup the pool of threads will have been made allowing the loop to take advantage of a pool of threads without the setup and teardown overhead of thread creation and deletion.

Figure 6.10: Sample code using thread pool

It is not trivial to find the most frequently executed loops in an application. First of all, the definition of most frequently executed loop is ambiguous. It could mean the loop with the largest number of iterations or it could mean the loop containing the largest amount of execution time. For the purposes of performance analysis, most frequently executed loop implies both definitions.

Consider the code in Figure 6.11 below which contains a doubly nested loop. The execution time of the outer loop is larger than the inner loop. The inner loop iterates more than the outer loop.

Figure 6.11: Classify most frequently executed loop

For the purposes of threading code, both loops should be considered. In addition, not all loops are as well structured as a simple for loop. Loops could occur over large areas of code, use more complicated control structures such as while loops or even goto statements, and contain multiple entries into and exits out of the loop.

The current generation of performance analysis tools do not contain a direct method of determining the top loops in the program. Instead you should analyze the code supplemented by the analysis information collected in the previous steps to determine the top loops and classify them appropriately.

The following are questions to consider when determining the top loops:

1) What does the flow chart tell you?
2) What are the top functions in the application?
3) What are the innermost loops in the function?
4) In the call graph, what is the ratio between a callee function and a called function contained inside a loop?
5) Do the innermost loops contain relatively the largest number of events in anexecution profile?

Once the top loops are determined, the loops should be classified on several categories:

1) What is the average execution count of the loop?
2) Is the loop iteration count bounded upon entry into the loop?
3) How well structured is the loop? Multiple entries and exits?

The average execution count of a loop can be estimated based on a number of ways. Two simple approaches are observing the loop using a debugger and adding output commands to display the loop iteration counts.

The average execution count is important for a number of reasons. First, if the iteration count is small it may not make sense to parallelize the region. Second, a large iteration count will guide location of the threads that are created. Creating threads inside such a loop would create a large amount of overhead.

Some threading techniques require that a loop's iteration count is known before entry into the loop. The iteration count can be dynamically determined at run-time, but not dynamically altered once the loop has begun iterating. Some threading techniques require well-structured loops. If the loop has multiple entries and exits, it may prove difficult to thread the loop without significant modifications.

Design and Implementation .Designing for threads depends on a number of constraints including thread technology, application and target characteristics, customer requirements, and programmer preferences, to name a few.

It is out of the scope of this series to teach how to design for threads using every technology available. Instead, what is provided is a comparison of different thread technology that will help guide your decision on which technology to implement on. The end of this part in this series includes and summarizes a number of references to help learn different thread technology in detail.

Part of the design decision is the choice of technology on which to implement. Table 6.3 below provides a high level comparison of various thread technologies.

Table 6.3: Comparison of thread technologies

Low level threads embodied by Pthreads tend to be very flexible in handling different parallelism requirements because the programmer controls the threads explicitly. The programmer is responsible for marshalling the code that needs to be threaded into a routine, explicitly calling the Pthreads routine to spawn the threads, handling communication and synchronization between threads, and destroying the threads.

Flexibility comes at a price as the level of complexity is higher than other thread technologies. Explicit threading can be employed to perform both functional and data decomposition although using it for data decomposition can be challenging.

Portability of Pthreads is widespread with many modern OSes supporting it; however, there are many embedded OSes that do not. The code impact is somewhat high because the modifications required to multi-thread the code results in code that looks very different than the original serial code. Choose explicit threads if maximum flexibility is desired and you are willing to invest extra effort compared to other thread technologies.

OpenMP is flexible in handling data decomposition very well, but is somewhat limited in handling functional decomposition. OpenMP sections can be used to multi-thread some programs via functional decomposition; however, the semantics require the number of functions and name of each function to be statically known.

OpenMP is of medium difficulty as the programmer needs to understand where the threading occurs as well as denoting which variables are shared and which are private. The availability of compilerbased threading is dependent on compiler support and the underlying thread support. One advantage of OpenMP is that the code impact can be low.

The parallel version of the code can look very similar to the original serial code if implemented correctly. An example of OpenMP ' s low impact on source code is discussed in the next section. Choose OpenMP if your algorithm is amenable to data decomposition.

Autoparallelism and speculative precomputation are similar to OpenMP in many of the categories listed in Table 6.3 except two. Ease of use is high because the user is only required to specify a compiler option on the command line; the compiler does the rest.

Flexibility is lower than OpenMP in that the compiler is typically more conservative in threading regions of code due to the lack of programmer knowledge about program characteristics. Choose autoparallelism and speculative precomputation if you desire easy access to parallel execution and are willing to accept the limitations.

Domain-specific thread libraries, an example being Intel IPP, are flexible within the particular problem domain. Intel IPP contains threaded routines to perform many kinds of transforms specific to media processing, however, if Intel IPP does not contain a function that meets your exact needs and your needs are not decomposed to lower level Intel IPP functions, then you are out of luck. Choose domain-specific thread libraries if your needs can be expressed using the available library functions.

Intel Threading Building Blocks (TBBs) is a fairly recent approach for parallelism. It is very flexible in handling both data and functional decomposition well. Ease of use is moderate requiring understanding of C++ templates.

Portability is limited being restricted to Intel platforms and specific OSes. Code impact is moderate due to the need to modify the serial code to call TBB template functions and use special TBB provided objects. Choose Intel TBB if you are programming in C++ and need both data decomposition and functional decomposition.

Code Modifications . Once the design is ready, the code modifications are made. Traditional best-known software development practices should be employed such as using source control tools, code reviews, etc. One tip when implementing is to code a non-threaded version alongside the threaded version. The form of the non-threaded version will be different depending on the particular thread technology applied.

For example, one strength of OpenMP is the low impact to the original serial source code. If coded properly, your source code that has been multi-threaded using OpenMP pragmas can be turned back into serial code by simply omitting the OpenMP option from the compiler command line.

OpenMP has this ability because the default C behavior when compiling an unrecognized pragma is to emit a warning and continue compilation. Below using the code sample in

, let ' s observe how to write OpenMP code so that it can execute correctly without the OpenMP option.

Figure 6.12: Code sample with unprotected OpenMP usage

First, Figure 6.12 above contains code that has been multi-threaded using OpenMP that has two issues that prevent running serially:

1. Unguarded call of OpenMP library routines.
2. Correct execution depends upon OpenMP library return value.

Figure 6.13 below lists a modified version of the code in Figure 6.12 that is easily converted back and forth between a serial and OpenMP version.

Figure 6.13: Sample OpenMP code with protected OpenMP usage

The call to omp_get_num_threads is protected via an #ifdef directive and is only called when OpenMP is specified on the command line. The OpenMP specification states that the symbol _OPENMP should be defined when OpenMP is specified on the command line.

The value, num_steps, is dependent on the return value of omp_get_num_ threads() . A solution is to modify the code so no critical values are dependent on the OpenMP API function calls. One technique is to use an #else directive as an alternative compilation path when OpenMP is not specified on the command line.

Now, let ' s consider the compilation and link sequence and code size and dependency impacts. A typical compilation sequence for the code in Figure 6.13 is:

icc -c code.c “openmp
icc -o code code.o “openmp

Executing the size command on the executable reveals the following:

size code

Without the -openmp option on the command line, the size command returns the following:

As you can see, the use of the OpenMP option results in a larger executable. Please note that this example is a small program that uses a small set of OpenMP features. A typical application is much larger and the additional code size attributed to OpenMP will be amortized over the overall size of the application. Nevertheless, this size increase should be understood in the context of your embedded project.

Performing an ldd shows the libraries that the implementation is dependent upon:

ldd openmp = _ /lib/tls/ (0 _ 40028000) = _ /opt/spdtools/compiler/ia32/cc-9.1.046/lib/ (0 _ 4004a000) = _ /lib/ (0 _ 40088000) = _ /lib/tls/ (0 _ 40090000) = _ /lib/tls/ (0 _ 42000000) = _ /lib/ (0 _ 4009d000)
/lib/ = _ /lib/ (0 _ 40000000)

As you can see, the use of OpenMP is dependent on a library provided by the compiler, and the underlying Pthreads on the system, In a cross compilation environment, you should make sure all of the shared libraries are available on your target.

One other option on creating single source serial/parallel code is to use the OpenMP stubs library. To do this, compile the code in Figure 6.13 earlier as you did before, but on the link line use – openmp_stubs .

The code modification to handle the call to omp_get_num_threads() would become unnecessary because the stub libraries provide an implementation that returns 1. In general the stub routines available in the library behave with serial semantics.

The concept of using preprocessing to enable single source serial/parallel code works the same using other thread technologies. Some Pthreads implementations provide a stub library interface that can be used for the same purpose.

To read Part 1 , go to A parallelism and threading primer.
Next in Part 3: Debugging in a Multithreaded Environment

Used with permission from Newnes, a division of Elsevier, from “Software Development for Embedded Multi-core Systems” by Max Domeika. Copyright 2008. For more information about this title and other similar books, please visit

Max Domeika is a senior staff software engineer in the Developer Products Division at Intel , creating software tools targeting the Intel Architecture market. Max currently provides technical consulting for a variety of products targeting Embedded Intel Architecture.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.