Threading through the parallel code optimization maze: Part 1 -

Threading through the parallel code optimization maze: Part 1

Accomplishing tasks in parallel comes in many guises. At the highest level, multiple computers can divide a task and work on it collectively. An example of this type of parallelism is the SETI@home [1] project where thousands of computers work collectively on analyzing radio signals for patterns indicative of intelligence. At the lowest level, modern microprocessors execute instructions specified in its machine language in parallel.

For example, the Intel Pentium processor has a U execution pipe and a V execution pipe capable of executing different instructions at the same time. The motivation for enabling parallel execution at these different levels of computation is performance.

Furthermore, the use of one type of parallel execution does not negate the benefit of employing other types of parallel execution. Threads are another means of enabling parallelism that complements the already mentioned types and is of smaller granularity than the process level concurrency mentioned in the SETI@home example, but certainly of larger granularity than the instruction level parallelism example of the Pentium processor.

Threads offer the ability to have multiple sections of one task in fl ight at the same time. In technical-speak: threads offer the ability to concurrently execute multiple contexts of the same process.

A real life analogy helps illustrate how threading relates to the other two types of parallelism. Consider the task of harvesting a crop from a farm and a worker as one tractor. The farm is split into multiple fields and a tractor only works on one field.

In this example, harvesting the crop is the task (e.g., the work that needs completion). If there is only one tractor per field, there is no need to explicitly coordinate the work of the tractors. Each tractor has their preassigned space to harvest and can complete the task independently of the other tractors.

Now, assume a second tractor is added to work one field. In this case the two tractors need to be concerned about what each other is doing. There are basic issues of feasibility ” how do the tractors keep from running into each other?

There are issues of optimizing performance ” how is the work split between the tractors to maximize performance and harvest as quickly as possible? This real life example is similar to threading. The threads are akin to multiple tractors tending to one field and similar issues exist. Threads must coordinate their work properly and efficiently or risk crashes and low performance.

In the previous example as long as there was only one tractor on each field they were safe. This is similar to the operating system (OS) concept of a process. Two processes do not share the memory that the OS assigns to them so the chances of them interfering with each other are minimized. ( Processes physically share the memory on the system, but the OS makes it appear to the process to be disjoint memory regions. Modern OSes will throw an exception if a process attempts to access memory outside of its assigned memory.

In terms of scheduling the processes, the OS determines where and when each process will execute. With regard to threads, the software that is either implemented by the programmer or provided by a library that abstracts some of this low level work must coordinate the threads. Nevertheless, designing for threads requires a disciplined process for best effect.

In order to take advantage of threading, the software development process detailed in this two part series is recommended. The Threading Development Cycle (TDC) focuses on the specific needs and potential challenges introduced by threads. It is comprised of the following steps:

1. Analyze the application
2. Design and implement the threading
3. Debug the code
4. Performance tune the code.

The first two parts in this series are specific to cases where the OS provides software threads and shared memory although the steps detailed in the TDC can be applied in the more general case. Before taking a deeper look at the TDC, some basic concepts of parallel programming are explained.

Parallelism and TDC
The typical goal of threading is to improve performance by either reducing latency or improving throughput. Reducing latency is also referred to as reducing turnaround time and means shortening the time period from start to completion of a unit of work. Improving throughput is defined as increasing the number of work items processed per unit of time.

Thread. .A thread is an OS entity that contains an instruction pointer , stack, and a set of register values. To help in understanding, it is good to compare a thread to a process. An OS process contains the same items as a thread such as an instruction pointer and a stack, but in addition has associated with it a memory region or heap.

Logically, a thread fits inside a process in that multiple threads have different instruction pointers and stacks, but share a heap that is associated with a process by the OS. Threads are a feature of the OS and require the OS to share memory which enables sharing of the heap. This is an important fact because not all embedded OSes support shared memory and as a result, not all embedded OSes support threads.

A clarification is that the type of threads discussed in this series is user level software threads. Hardware threads are a feature of many microprocessors and are quite distinct in the meaning and capability.

For example, the term simultaneous multi-threading is a microprocessor feature that enables one processor core to appear and function as multiple processor cores. For the purposes of this series, hardware threads are relevant only in that they may provide the processor cores that the software threads execute upon.

One other clarification is that this series discusses user level threads only. We do not include discussion of kernel level threads or how different OSes map user level threads to kernel threads.

Decomposition . Effectively threading an application requires a plan for assigning the work to multiple threads. Two categories of dividing work are functional decomposition and data decomposition and are summarized as:

1. Functional decomposition: division based upon the type of work.
2. Data decomposition: division based upon the data needing processing.

Functional decomposition is the breaking down of a task into independent steps in your application that can execute concurrently. For example, consider an intrusion detection system that performs the following checks on a packet stream:

1) Check for scanning attacks
2) Check for denial of service attacks

As long as each step above was an independent task, it would be possible to apply functional decomposition and execute each step concurrently. Figure 6.1 below shows sample OpenMP code that uses the section directive to express the parallelism.

Figure 6.1 : Functional decomposition example

When this code is executed, the OpenMP run-time system executes the function in each OpenMP section on a different thread and in parallel (as long as the number of processor cores exceeds the number of threads executing). In practice, attempting to execute multiple threads on the same data at the same time may result in less than ideal performance due to the cost of synchronizing access to the data.

Pipelining is a category of functional decomposition that reduces the synchronization cost while maintaining many of the benefits of concurrent execution. Data decomposition is the breaking down of a task into smaller pieces of work based upon the data that requires processing. For example, consider an image processing application where multiple images need to be converted from one format to another format.

Each conversion takes on the order of seconds to complete and the processing of each image is independent of the processing of the other images. This application lends itself quite naturally to data decomposition. Figure 6.2 below shows sample OpenMP code using the parallel for directive.

Figure 6.2 : Data decomposition example

When this code is executed, the OpenMP run-time system will divide the processing of the images between the allocated threads for parallel execution (assuming the number of processor cores is greater than 1 ). If the processing of each individual image consumed a great deal of time, it may make sense to multi-thread the processing of the individual image and execute the processing of the subimages by different threads.

In general, it is easier to scale applications by employing data decomposition than it is using functional decomposition. In practice you may find a combination of different decompositions works well for your particular application.

Scalability . Scalability is the degree to which an application benefits from additional processor cores. As the number of cores the application uses is increased, it would be nice if performance of the application increased as well.

There are natural limits to how much of an application can be executed in parallel and this limits the obtained performance benefit of parallelism. Amdahl ' s Law below is used to compute the limits of obtained performance and is expressed [2] :

where Fractione is the amount of the application that executes in parallel; and Speedupe is how many times faster the parallel portion executes compared to the original.

For example, consider an application that executes in parallel 50% of the time. Also, assume the application executes on a system with four processor cores and was threaded in such a way that the performance scales with the number of processor cores. In this example, Fractione = 0.5 and Speedupe = 4, and therefore Speedupe = 1.6.

Efficiency is a measure of how effectively the total number of processor cores is being employed in running the application in parallel; the goal is a 100% measure of efficiency. In the previous example, three processor cores are idle for 4/5 of the execution time, which means 12/20 of the four processor cores ' time is idle and thus 8/20 of the time the processor cores are active with 40% efficiency.

Consider another example involving the aforementioned image processing problem with the following constraints:

1) Ten seconds of initialization that must run serially
2) One second to process one image (processing of different images can be accomplished in parallel)
3) Ten seconds of post-processing that must run serially

Table 6.1 below shows the calculations of scalability and efficiency for a number of different processor cores. One observable trend is that as the number of processor cores increases, the corresponding decrease in execution time is not as significant.

Table 6.1 : Image processing scalability and efficiency

In other words, the speedup does not scale linearly with the number of processor cores. For example, the scalability with 32 processor cores is 10.89 and with 300 processors is 15.24, not 108.9 (10 x 10.89). This trend occurs because the serial portion of the application is beginning to dominate the overall execution time. With 16 processor cores, the execution time of the image processing step is 300/16 = 18.75 s.

With 300 processor cores, the execution time of the image processing step is 300/300 = 1 s. Furthermore, 299 processor cores are active for only 1 s out of the 21 s of total execution time. Thus, efficiency is 5.1%. The conclusion is: maximize the benefits of parallelism by parallelizing as much of the application as possible.

One other point worth mentioning is that scalability should always be compared against the best achievable time on one processor core. For example, if the use of a new compiler and optimization settings resulted in a decrease in execution time when run on one processor core, the scalability numbers and efficiency percentages should be recalculated.

Parallel Execution Limiters
All problems do not lend themselves to parallelization and understanding some common limiters is essential. Also, some problems may be executed in parallel only after special handling of these potential limiters. The key limiters to efficient parallel execution are summarized as:

1. Shared variables: variables that are accessible by multiple threads. Typically, these are global variables or local variables in scope of two or more threads.

2. Data dependency: variable A is said to be data dependent upon variable B if the value of A is calculated based in part or whole upon the value of variable B.

3. State in routines: routines with values saved and used across invocations.

4. Memory bandwidth capacity: applications executing in parallel are demanding more data than the path to memory can serve at one time.

The loop in Figure 6.3 below has a data dependency between iterations on the sum variable and would typically limit executing the code in parallel.

Figure 6.3 : Data dependency example

Suppose the loop in Figure 6.3 was threaded by dividing the iteration space between two threads. Thread one executes the loop from i equal to 0 to ( num_steps/2 )-1 and thread two executes from i equal to num_steps/2 to num_steps-1 .

In order to ensure correct parallel execution, the variables referenced inside of the loop body require special handling. As an illustration, consider the threading impact on the variable, sum . For simplicity, assume that (4.0/(1.0+x * x))=1.0 and that sum is initially 0.

Figure 6.4 below shows a potential timeline of execution for the two threads as they execute the loop in parallel.

Figure 6.4 : Data race timeline

In this figure, the C statement, sum=sum+4.0/(1.0+x * x) , is represented by a read operation, a register operation, and a subsequent write operation. The read operation takes the value from memory and places it into a register. The write operation takes the value that was computed in the register and places it back into memory.

Please keep in mind that the register used by thread one and thread two are distinct. The execution of the C statement is not guaranteed to be atomic, which means that thread one may be in the middle of reading and writing when thread two starts its own read and write.

From Figure 6.4 , you see that the sum variable is incremented two times; however, at time 3 after the two increments, sum is still equal to one which is obviously incorrect.

This simple example illustrates one of the fundamental challenges when threading applications ” the determination of which variables need to be shared between threads and how to effectively coordinate access to them. To add to the difficulty of this challenge, there is no guarantee that the problem detailed above will actually exhibit itself every time you execute the application. The execution order of threads is nondeterministic.

This means that the problem may not occur during the development and subsequent debug of the code and only rears itself in the field once a customer is using the application. In other words, some of the time thread one will read and write the sum variable uninterrupted as will thread two.

This issue is termed a race condition and techniques to handle this type of issue are detailed later in this series. A more complicated example is shown in Figure 6.5 below .

Figure 6.5 : Data dependency example 2

The complication is due to the dependency between the variable sum and the variable extra that spans across iterations of the loop. A method of testing if a particular loop can be parallelized is to run the loop backward in terms of its loop indexes.

In this simple example, sum will have a different value if executed with i=num_steps-1 to 0 versus i=0 to num_steps-1 .

While this example is conceptually simple, the issue manifests itself in many types of code, particularly those involving matrix operations. In addition, some forms of hand optimizations commonly performed such as loop unrolling and software pipelining can introduce these dependencies and make threading problematic.

An example of state in routines is a routine that maintains data between invocations of the routine using statically defined variables. Examples of routines with state include typical memory allocation routines, random number generators, and I/O routines. Three solutions to calling functions that are not thread safe in your parallel code are:

1. Restrict access to these functions to allow only one thread to call the function at a time.
2. Use thread safe versions of the routines.
3. Develop your routines to be reentrant.

Many library providers ensure thread safety as a feature of their library. Consult your library ' s documentation to ensure that they are thread safe before using them in your multi-threaded code.

Routines that are made reentrant are capable of being executed in parallel. These routines would not have dependencies between other invocations of the same routines. A typical sign that a routine is not reentrant is if the routine defines and uses static variables .

Global variables pose a challenge because the values are shared between the threads. First, there may be correctness issues once different threads access these values. The correctness issues are handled using one of two methods:

1. Minimize the use of global variables.
2. Use synchronization to limit access to each global variable.

Unnecessary use of global variables is characteristic of older software developed before modern software engineering practices. Reducing the use of global variables in these cases can reduce the need for synchronization.

Memory bandwidth can also limit performance of multi-threaded applications because the bus is a shared resource between the processor cores. If the threads and other applications on the system are memory demanding with many reads from and writes to large amounts of data, it is possible for the bus to become the bottleneck.

This is a difficult issue to diagnose; symptoms that this issue may be occurring include:

1) Central processing units (CPU) utilization is close to 100%
2) Poor performance scaling is observed moving to a four or eight processor core system
3) Low number of context switches per second

It is also possible to compare the performance of your multi-threaded application when executing one thread with the application executing many threads and obtain an event-based sampling (EBS) profile on clock ticks for both.

Hotspots in the profile with many threads that do not occur in the profile with one thread may be indicative of high bandwidth demanding portions of your code.

Intel Architecture processors also have performance monitoring counters (PMCs) with events to measure usage of the bus. Information on employing the PMCs to determine if the bus is saturated in this manner can be found in an article by Levinthal [3] .

Threading Technology Requirements
The several technologies in use today to thread applications and these technologies can be classified into two broad categories, library-based and compiler-based threads. All of these threading technologies place requirements on the software development environment, the host and target system, and the end applications.

Library-based thread application programming interfaces (APIs) like POSIX threads or Win32 threads require the linking in of these thread libraries into your code. During compilation, the source files will reference the interfaces through the thread library header files.

Your source code can make calls to routines in the library and therefore during the link phase, the appropriate thread library must be referenced. The steps are:

1) Source code will contain include directives referencing the thread API header files
2) Source code will contain the appropriate calls to the thread API routines that are used
3) The thread API routines will be linked in during the link phase either dynamically or statically

These steps are no different than using any other type of library. However, the impact is worth detailing. The library routines that are linked into the application will add to the code size of your application in the case of static linking. In the case of dynamic linking the shared object will need to be available and supported on your target system.

Domain-specific thread libraries often rely upon lower level thread libraries. As a result, use of domain-specific libraries can require access to the underlying lower level thread library as well as the domain-specific library.

For example, the multi-threading routines available in Intel Integrated Performance Primitives (IPP) rely upon POSIX threads on Linux systems. Therefore, an application that makes calls to multi-threaded Intel IPP routines may have dependencies on Pthreads as well as Intel IPP.

The situation with compiler-based threading is similar to domain-specific libraries. In addition to requiring headers and libraries provided by the compiler, there may be dependencies on underlying thread libraries when features such as auto-parallelization and OpenMP are used.

Next in Part 2: The Threading Development Cycle .

Used with permission from Newnes, a division of Elsevier, from “Software Development for Embedded Multi-core Systems” by Max Domeika. Copyright 2008. For more information about this title and other similar books, please visit

Max Domeika is a senior staff software engineer in the Developer Products Division at Intel , creating software tools targeting the Intel Architecture market. Max currently provides technical consulting for a variety of products targeting Embedded Intel Architecture.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.