Applying the fundamentals of parallel programming to multiprocessor designs: Part 2
By Shameem Akhter and Jason Roberts, Intel Corp.
To see how you might apply the methods outlined in
Part 1 to a practical computing
problem, consider the
error diffusion
algorithm that is used in many
computer graphics and image processing programs.
Originally proposed by Floyd
and Steinberg, error
diffusion is a technique for displaying continuous-tone digital images
on devices that have limited color (tone) range.
Printing an 8-bit grayscale image to a black-and-white printer is
problematic. The printer, being a bi-level device, cannot print the
8-bit image natively. It must simulate multiple shades of gray by using
an approximation technique.
An example of an image before and after the error diffusion process
is shown in
Figure 3.2 below.
The original image, composed of 8-bit grayscale pixels, is shown on the
left, and the result of the image that has been processed using the
error diffusion algorithm is shown on the right.
The output image is composed of pixels of only two colors: black and
white. Original 8-bit image on the left, resultant 2-bit image on the
right. At the resolution of this printing, they look similar.
The same images as above but zoomed to 400 percent and cropped to
25 percent to show pixel detail. Now you can clearly see the 2-bit
black-white rendering on the right and 8-bit gray-scale on the left.
 |
| Figure
3.2 Error Diffusion Algorithm Output |
The basic error diffusion algorithm does its work in a simple three-step
process:
Step #1. Determine
the output value given the input value of the current pixel. This step
often uses quantization, or in the binary case, thresholding. For an
8-bit grayscale image that is displayed on a 1-bit output device, all
input values in the range [0, 127] are to be displayed as a 0 and all
input values between [128, 255] are to be displayed as a 1 on the
output device.
Step #2. Once the output
value is determined, the code computes the error between what should be
displayed on the output device and what is actually displayed. As an
example, assume that the current input pixel value is 168. Given that
it is greater than our threshold value (128), we determine that the
output value will be a 1.
This value is stored in the output array. To compute the error, the
program must normalize output first, so it is in the same scale as the
input value. That is, for the purposes of computing the display error,
the output pixel must be 0 if the output pixel is 0 or 255 if the
output pixel is 1. In this case, the display error is the difference
between the actual value that should have been displayed (168) and the
output value (255), which is -87.
Step #3.
Finally, the error value is distributed on a fractional basis to the
neighboring pixels in the region, as shown in Figure 3.3 below.
 |
| Figure
3.3. Distributing Error Values to Neighboring Pixels |
This example uses the Floyd-Steinberg error weights to propagate
errors to neighboring pixels. 7/16ths of the error is computed and
added to the pixel to the right of the current pixel that is being
processed. 5/16ths of the error is added to the pixel in the next row,
directly below the current pixel.
The remaining errors propagate in a similar fashion. While you can
use other error weighting schemes, all error diffusion algorithms
follow this general method.
The three-step process is applied to all pixels in the image. Listing 3.1 below shows a simple C
implementation of the error diffusion algorithm, using Floyd-Steinberg
error weights.
 |
| Listing
3.1. C-language Implementation of the Error Diffusion Algorithm |
Analysis of the Error Diffusion
Algorithm
At first glance, one might think that the error diffusion algorithm is
an inherently serial process. The conventional approach distributes
errors to neighboring pixels as they are computed.
As a result, the previous pixel's error must be known in order to
compute the value of the next pixel. This interdependency implies that
the code can only process one pixel at a time. It's not that difficult,
however, to approach this problem in a way that is more suitable to a
multithreaded approach.
An Alternate Approach: Parallel
Error Diffusion
To transform the conventional error diffusion algorithm into an
approach that is more conducive to a parallel solution, consider the
different decomposition that were covered previously in this chapter.
Which would be appropriate in this case? As a hint, consider Figure 3.4 below, which revisits the
error distribution illustrated in Figure 3.3, from a slightly different
perspective.
 |
| Figure
3.4. Error-Diffusion Error Computation from the Receiving Pixel's
Perspective |
Given that a pixel may not be processed until its spatial
predecessors have been processed, the problem appears to lend itself to
an approach where we have a producer - or in this case, multiple
producers - producing data (error values) which a consumer (the current
pixel) will use to compute the proper output pixel.
The flow of error data to the current pixel is critical. Therefore,
the problem seems to break down into a data-flow decomposition.
Now that we identified the approach, the next step is to determine
the best pattern that can be applied to this particular problem. Each
independent thread of execution should process an equal amount of work
(load balancing).
How should the work be partitioned? One way, based on the algorithm
presented in the previous section, would be to have a thread that
processed the even pixels in a given row, and another thread that
processed the odd pixels in the same row. This approach is ineffective
however; each thread will be blocked waiting for the other to complete,
and the performance could be worse than in the sequential case.
To effectively subdivide the work among threads, we need a way to
reduce (or ideally eliminate) the dependency between pixels.
Figure 3.4 above illustrates
an important point that's not obvious in Figure 3.3 earlier -
that in order for a pixel to be able to be processed, it must have
three error values - labeled eA, eB, and eC1 in Figure 3.3 - from the
previous row, and one error value from the pixel immediately to the
left on the current row. [We assume
eA = eD = 0 at the left edge of the page (for pixels in column 0); and
that eC = 0 at the right edge of the page (for pixels in column W-1,
where W = the number of pixels in the image)].
Thus, once these pixels are processed, the current pixel may
complete its processing. This ordering suggests an implementation where
each thread processes a row of data. Once a row has completed
processing of the first few pixels, the thread responsible for the next
row may begin its processing. Figure
3.5 below shows this sequence.
 |
| Figure
3.5 Parallel Error Diffusion for Multi-thread, Multi-row Situation |
Notice that a small latency occurs at the start of each row. This
latency is due to the fact that the previous row's error data must be
calculated before the current row can be processed. These types of
latency are generally unavoidable in producer-consumer implementations;
however, you can minimize the impact of the latency as illustrated
here.
The trick is to derive the proper workload partitioning so that each
thread of execution works as efficiently as possible. In this case, you
incur a two-pixel latency before processing of the next thread can
begin.
An 8.5" X 11" page, assuming 1,200 dots per inch (dpi), would
have
10,200 pixels per row. The two-pixel latency is insignificant here. The
sequence in
Figure 3.5 above
illustrates the data flow common to the wavefront pattern.
Other Alternatives
In the previous section, we proposed a method of error diffusion where
each thread processed a row of data at a time. However, one might
consider subdividing the work at a higher level of granularity.
Instinctively, when partitioning work between threads, one tends to
look for independent tasks. The simplest way of parallelizing this
problem would be to process each page separately. Generally speaking,
each page would be an independent data set, and thus, it would not have
any interdependencies.
So why did we propose a row-based solution instead of processing
individual pages? The three key reasons are:
1) An image may
span multiple pages. This implementation would impose a restriction of
one image per page, which might or might not be suitable for the given
application.
2) Increased
memory usage. An 8.5 x 11-inch page at 1,200 dpi consumes 131 megabytes
of RAM. Intermediate results must be saved; therefore, this approach
would be less memory efficient.
3) An
application might, in a common use-case, print only a single page at a
time. Subdividing the problem at the page level would offer no
performance improvement from the sequential case.
A hybrid approach would be to subdivide the pages and process
regions of a page in a thread, as illustrated in Figure 3.6 below.
 |
| Figure
3.6. Parallel Error Diffusion for Multi-thread, Multi-page Situation |
Note that each thread must work on sections from different page.
This increases the startup latency involved before the threads can
begin work. In Figure 3.6 above,
Thread 2 incurs a 1/3 page startup latency before it can begin to
process data, while Thread 3 incurs a 2/3 page startup latency.
While somewhat improved, the hybrid approach suffers from similar
limitations as the page-based partitioning scheme described above. To
avoid these limitations, you should focus on the row-based error
diffusion implementation illustrated in Figure 3.5.
Summary
The key points to keep in mind when developing solutions for parallel
computing architectures are:
1) Decompositions
fall into one of three categories: task, data, and data flow.
2) Task-level
parallelism partitions the work between threads based on tasks.
3) Data
decomposition breaks down tasks based on the data that the threads work
on.
4) Data flow
decomposition breaks down the problem in terms of how data flows
between the tasks.
5) Most parallel
programming problems fall into one of several well known patterns.
6) The constraints
of synchronization, communication, load balancing, and scalability must
be dealt with to get the most benefit out of a parallel program.
Many problems that appear to be serial may, through a simple
transformation, be adapted to a parallel implementation.
To read Part 1, go to Breaking
up tasks for multicore multithreading
This series of
two articles was excerpted from Multi-Core
Programming by
Shameem Akhter and Jason Roberts, published by Intel Press. Copyright
© 2006 Intel Corporation. All rights reserved. This book can be
purchased on line.
Shameem Akhter is a platform
architect at Intel Corp., focusing on single socket multicore
architecture and performance analysis. Jason Roberts is a senior
software engineer at Intel, working on a number of different
multithreaded software products ranging from desktop, handheld and
embedded DSP platforms.
To learn more about designing
with multicores
go to More
About Multicores and Multiprocessing.