The C language has no native syntaxsupport for multithreading. There are twoopen standard APIs, POSIX threads and OpenMP . POSIXthreads, commonly known as Pthreads, are a low level API, while OpenMPrelies on compiler directives.
A Pthreads implementation is linked in as a normal library, so nospecial compiler support is required. The low level implementationallows precise control over threading and synchronization, but the needfor explicit function calls requires significant modification to thesequential code to express the parallelism.
OpenMP is a higher level API. OpenMP attempts to express parallelismusing compiler directives to annotate the source code without modifyingit. Because it relies on compiler directives, specific compiler supportis required.
The higher level approach of OpenMP may result in less efficiencythan Pthreads in some cases, but the syntax is more concise and easierto learn. A significant advantage is that the OpenMP annotation may bedisabled to run the code sequentially. OpenMP will be used in theexamples which follow.
OpenMP annotations declare parallel regions rather than explicitlydefining threads (Figure 4 below ).When run, OpenMP implicitly begins execution in a master thread. Whenexecution enters a parallel region, OpenMP forks a team of threads toshare work within the region. When all threads have run to completion,the master thread is rejoined.
|Figure4: OpenMP Threads|
Work is typically shared across datasets by partitioning across loopiterations or across tasks by identifying code sections to run inparallel.
For the edge detection example, the implementation spends nearly allits execution time in one loop or another. The smoothing function takesthe most execution time and contains three nested loops to process theimage by row, process a row by pixel, and process a pixel by filterproduct term.
Generally, the outer loops are the best opportunity for parallelism,and also tend to have more disjoint data access patterns which improvecache coherency. Figure 5 below represents the different loop levels. The lowest level requiresexcessive overhead to repeatedly fork and join for every pixel.
|Figure5: Inner versus Outer Loop Parallelization|
Listing 3 below shows aninitial OpenMP parallelization by row. The OpenMP compiler directive isshown in red .The parallel directive identifies a parallel region andthe for directive specifies worksharing across the for loop. The private clause identifies variables r, c, and i as localto eachthread. Variables defined outside the pragma scopeare shared acrossthreads by default. The private clause redefines these variableslocally without requiring them to be explicitly moved inside thepragma.
|Listing3: Parallel Smoothing by Row|
The incorrect results of the parallelization are shown in Figure 6 below.
|Figure6: Incorrect Data Parallel Edge Detection|
Disabling the parallel pragma yields correct results, so it's likelythat some shared variables are being incorrectly updated. Examining theloops, it's clear that sum and offset shouldhave been local to theiterations.
These variables could be defined inside the loop, as suggested bythe commented line, or they could be added to the private clause asshown in Listing 4 below .Testing this version yields an edge detection image which exactlymatches the original sequential computation.
|Figure6: Incorrect Data Parallel Edge Detection|
Data parallelization across loops is the most common form ofparallelization used in OpenMP. It can often be described using onlycompiler directives. An alternative partitioning shares work threadsacross tasks.
In the edge detection example, the two tasks are the smoothing andSobel functions. Each task runs in its own thread by using the OpenMPsections directive. As each row of the smoothing function completes,its data must be communicated to the Sobel function, as shown in Figure 7 below .
Since both functions process the image rows monotonically, it issufficient to pass the number of rows completed rather than the row'sdata values as these will be passed via the shared memory. Thesmoothing function must stay at least two rows ahead so that the Sobelhas enough filtered data to calculate the derivatives.
|Figure7: Edge Detection Task Pipeline|
Input and output row counts are added to the function signaturesalong with a lag parameter which specifies how many input rows aheadthe smoothing function must be. The Sobel function spins until enoughinput rows are available.
The active spinning shown in this example is a simple approach, butmore efficient alternatives, such as conditional variables found inpthreads, enable work to be done while waiting. Even though the rowcounters are shared, the monotonic property is insensitive to threadinterleaving, so locks can be avoided. The modified Sobel function isshown in Listing 5 below .
|Listing5: Sobel Function with row control|
Modifications to the smoothing function for counting and waiting onthe number of rows are similar to those added to the Sobel function.The code describing the task parallelism is shown in Listing 6 below. The smoothingfunction must remain at least SOBEL1_EDGE + 1 rows ahead.
|Listing6: Task Pipeline Description|
Testing the task pipeline implementation yields the image in Figure 8 below . The image is mostlycorrect but clearly shows some spurious results. Initial testing didnot show these errors until some arbitrary delays were added such as the u sleep(50) function shown near thebottom of Listing 4 earlier .Run time scheduling policies and data dependent work loads can maskthese types of bugs and underscore the need for thorough testing.
|Figure8: Incorrect Task Parallel Edge Detection|
Row counts are the only shared variables, and though they appear tobe correct, it is likely that they are somehow involved in the error. Anext step is to examine dependencies to see if any relationships areviolated. Figure 9 below showsa relevant portion of dependency analysis from the CriticalBlue MulticoreCascade tool.
|Figure9: WAR Dependency in Original Code|
The tool identifies a write-after-read anti-dependency between theoutput image of the Sobel function and the input function of thesmoothing function. An examination of the source code in Listing 6 reveals that theinput image is overwritten with the final edge detection image. Thesequential code ran correctly because the smoothing function ran tocompletion before the Sobel function started.
When parallelized, it is possible for the Sobel function tooverwrite a row still required by the smoothing filter. One solution isto add a separate output buffer for the final image. Another solutionsimply delays the Sobel processing by an additional row. Listing 7 below shows the Sobel lagincreased to SMOOTH_EDGE + 1. The edge detection image is now identicalto the original sequential result.
|Listing7: Task Pipeline with Corrected Lag|
Starting with an annotative style such as OpenMP fosters an incrementalparallelization methodology with frequent testing. The data and taskparallelization examples show only a small subset of the OpenMP API.There are additional constructs which can handle more complexscenarios, for example dependencies and critical sections within loops.
There are many ways to further specialize data and taskdecomposition to match concurrency available in the algorithms beingrefactored for parallelism. There is a lot of experience which may beleveraged by studying available patterns and examples.
The greatest challenge is usually not in finding the parallelism,but in exploiting it in a correct and efficient manner. An initialmethodology starts with sequential code and incrementally addsparallelism, testing the changes at each step of the way.
To read Part 1, go to “What makesparallelizing C-code so hard.“
Skip Hovsmith is the Director ofApplication Engineering in the US for CriticalBlue , working on accelerating embedded software through synthesis ofprogrammable multicore processors for structured ASICs, SoCs, andFPGAs. Prior to CriticalBlue, Skip worked for several start ups informal verification, FPGA design, and enterprise software services, andat National Semiconductor working on virtual prototyping, hw/swco-design, reconfigurable systems, and standard cell product design.
 Sutter, Herb. “The Free Lunch Is Over: A Fundamental TurnToward Concurrency in Software.” Dr. Dobb's Journal, 30(3),March 2005.
 Mattson, Timothy G.,Beverly A. Sanders, and Berna L. Massingill, Patternsfor Parallel Programming. Boston: Addison-Wesley, 2005.
 Moreabout multicores and multiprocessors