Applying the fundamentals of parallel programming to multiprocessor designs: Part 2 - Embedded.com

Applying the fundamentals of parallel programming to multiprocessor designs: Part 2

To see how you might apply the methods outlined in Part 1 to a practical computingproblem, consider the error diffusionalgorithm that is used in manycomputer graphics and image processing programs.

Originally proposed by Floydand Steinberg, errordiffusion is a technique for displaying continuous-tone digital imageson devices that have limited color (tone) range.

Printing an 8-bit grayscale image to a black-and-white printer isproblematic. The printer, being a bi-level device, cannot print the8-bit image natively. It must simulate multiple shades of gray by usingan approximation technique.

An example of an image before and after the error diffusion processis shown in Figure 3.2 below. The original image, composed of 8-bit grayscale pixels, is shown on theleft, and the result of the image that has been processed using theerror diffusion algorithm is shown on the right.

The output image is composed of pixels of only two colors: black andwhite. Original 8-bit image on the left, resultant 2-bit image on theright. At the resolution of this printing, they look similar.

The same images as above but zoomed to 400 percent and cropped to25 percent to show pixel detail. Now you can clearly see the 2-bitblack-white rendering on the right and 8-bit gray-scale on the left.

Figure3.2 Error Diffusion Algorithm Output

The basic error diffusion algorithm does its work in a simple three-stepprocess:

Step #1. Determinethe output value given the input value of the current pixel. This stepoften uses quantization, or in the binary case, thresholding. For an8-bit grayscale image that is displayed on a 1-bit output device, allinput values in the range [0, 127] are to be displayed as a 0 and allinput values between [128, 255] are to be displayed as a 1 on theoutput device.

Step #2. Once the outputvalue is determined, the code computes the error between what should bedisplayed on the output device and what is actually displayed. As anexample, assume that the current input pixel value is 168. Given thatit is greater than our threshold value (128), we determine that theoutput value will be a 1.

This value is stored in the output array. To compute the error, theprogram must normalize output first, so it is in the same scale as theinput value. That is, for the purposes of computing the display error,the output pixel must be 0 if the output pixel is 0 or 255 if theoutput pixel is 1. In this case, the display error is the differencebetween the actual value that should have been displayed (168) and theoutput value (255), which is -87.

Step #3. Finally, the error value is distributed on a fractional basis to theneighboring pixels in the region, as shown in Figure 3.3 below .

Figure3.3. Distributing Error Values to Neighboring Pixels

This example uses the Floyd-Steinberg error weights to propagateerrors to neighboring pixels. 7/16ths of the error is computed andadded to the pixel to the right of the current pixel that is beingprocessed. 5/16ths of the error is added to the pixel in the next row,directly below the current pixel.

The remaining errors propagate in a similar fashion. While you canuse other error weighting schemes, all error diffusion algorithmsfollow this general method.

The three-step process is applied to all pixels in the image. Listing 3.1 below shows a simple Cimplementation of the error diffusion algorithm, using Floyd-Steinbergerror weights.

Listing3.1. C-language Implementation of the Error Diffusion Algorithm

Analysis of the Error DiffusionAlgorithm
At first glance, one might think that the error diffusion algorithm isan inherently serial process. The conventional approach distributeserrors to neighboring pixels as they are computed.

As a result, the previous pixel's error must be known in order tocompute the value of the next pixel. This interdependency implies thatthe code can only process one pixel at a time. It's not that difficult,however, to approach this problem in a way that is more suitable to amultithreaded approach.

An Alternate Approach: ParallelError Diffusion
To transform the conventional error diffusion algorithm into anapproach that is more conducive to a parallel solution, consider thedifferent decomposition that were covered previously in this chapter.

Which would be appropriate in this case? As a hint, consider Figure 3.4 below, which revisits theerror distribution illustrated in Figure 3.3, from a slightly differentperspective.

Figure3.4. Error-Diffusion Error Computation from the Receiving Pixel'sPerspective

Given that a pixel may not be processed until its spatialpredecessors have been processed, the problem appears to lend itself toan approach where we have a producer – or in this case, multipleproducers – producing data (error values) which a consumer (the currentpixel) will use to compute the proper output pixel.

The flow of error data to the current pixel is critical. Therefore,the problem seems to break down into a data-flow decomposition.

Now that we identified the approach, the next step is to determinethe best pattern that can be applied to this particular problem. Eachindependent thread of execution should process an equal amount of work(load balancing).

How should the work be partitioned? One way, based on the algorithmpresented in the previous section, would be to have a thread thatprocessed the even pixels in a given row, and another thread thatprocessed the odd pixels in the same row. This approach is ineffectivehowever; each thread will be blocked waiting for the other to complete,and the performance could be worse than in the sequential case.

To effectively subdivide the work among threads, we need a way toreduce (or ideally eliminate) the dependency between pixels.

Figure 3.4 above illustratesan important point that's not obvious in Figure 3.3 earlier –  that in order for a pixel to be able to be processed, it must havethree error values – labeled eA, eB, and eC1 in Figure 3.3 – from theprevious row, and one error value from the pixel immediately to theleft on the current row. [We assumeeA = eD = 0 at the left edge of the page (for pixels in column 0); andthat eC = 0 at the right edge of the page (for pixels in column W-1,where W = the number of pixels in the image) ].

Thus, once these pixels are processed, the current pixel maycomplete its processing. This ordering suggests an implementation whereeach thread processes a row of data. Once a row has completedprocessing of the first few pixels, the thread responsible for the nextrow may begin its processing. Figure3.5 below shows this sequence.

Figure3.5 Parallel Error Diffusion for Multi-thread, Multi-row Situation

Notice that a small latency occurs at the start of each row. Thislatency is due to the fact that the previous row's error data must becalculated before the current row can be processed. These types oflatency are generally unavoidable in producer-consumer implementations;however, you can minimize the impact of the latency as illustratedhere.

The trick is to derive the proper workload partitioning so that eachthread of execution works as efficiently as possible. In this case, youincur a two-pixel latency before processing of the next thread canbegin.

An 8.5″ X 11″ page, assuming 1,200 dots per inch (dpi), wouldhave10,200 pixels per row. The two-pixel latency is insignificant here. Thesequence in Figure 3.5 above illustrates the data flow common to the wavefront pattern.

Other Alternatives
In the previous section, we proposed a method of error diffusion whereeach thread processed a row of data at a time. However, one mightconsider subdividing the work at a higher level of granularity.

Instinctively, when partitioning work between threads, one tends tolook for independent tasks. The simplest way of parallelizing thisproblem would be to process each page separately. Generally speaking,each page would be an independent data set, and thus, it would not haveany interdependencies.

So why did we propose a row-based solution instead of processingindividual pages? The three key reasons are:

1) An image mayspan multiple pages. This implementation would impose a restriction ofone image per page, which might or might not be suitable for the givenapplication.

2) Increasedmemory usage. An 8.5 x 11-inch page at 1,200 dpi consumes 131 megabytesof RAM. Intermediate results must be saved; therefore, this approachwould be less memory efficient.

3) Anapplication might, in a common use-case, print only a single page at atime. Subdividing the problem at the page level would offer noperformance improvement from the sequential case.

A hybrid approach would be to subdivide the pages and processregions of a page in a thread, as illustrated in Figure 3.6 below.

Figure3.6. Parallel Error Diffusion for Multi-thread, Multi-page Situation

Note that each thread must work on sections from different page.This increases the startup latency involved before the threads canbegin work. In Figure 3.6 above, Thread 2 incurs a 1/3 page startup latency before it can begin toprocess data, while Thread 3 incurs a 2/3 page startup latency.

While somewhat improved, the hybrid approach suffers from similarlimitations as the page-based partitioning scheme described above. Toavoid these limitations, you should focus on the row-based errordiffusion implementation illustrated in Figure 3.5.

Summary
The key points to keep in mind when developing solutions for parallelcomputing architectures are:

1) Decompositionsfall into one of three categories: task, data, and data flow.
2) Task-levelparallelism partitions the work between threads based on tasks.
3) Datadecomposition breaks down tasks based on the data that the threads workon.
4) Data flowdecomposition breaks down the problem in terms of how data flowsbetween the tasks.
5) Most parallelprogramming problems fall into one of several well known patterns.
6) The constraintsof synchronization, communication, load balancing, and scalability mustbe dealt with to get the most benefit out of a parallel program.

Many problems that appear to be serial may, through a simpletransformation, be adapted to a parallel implementation.

To read Part 1, go to Breakingup tasks for multicore multithreading

This series oftwo articles was excerpted from Multi-CoreProgramming byShameem Akhter and Jason Roberts, published by Intel Press. Copyright© 2006 Intel Corporation. All rights reserved. This book can bepurchased on line.

Shameem Akhter is a platformarchitect at Intel Corp., focusing on single socket multicorearchitecture and performance analysis. Jason Roberts is a seniorsoftware engineer at Intel, working on a number of differentmultithreaded software products ranging from desktop, handheld andembedded DSP platforms.

To learn more about  designingwith multicoresgo to MoreAbout Multicores and Multiprocessing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.