Applying the fundamentals of parallel programming to multiprocessor designs: Part 1 -

Applying the fundamentals of parallel programming to multiprocessor designs: Part 1

Parallel programming uses threads to enable multiple operations toproceed simultaneously. The entire concept of parallel programmingcenters on the design, development, and deployment of threads within anapplication and the coordination between threads and their respectiveoperations.

In Part 1 we will examine how to do multitaskingin a parallelprogramming applications by breaking up programming tasks into chunksthat are suitable for threading. In Part 2, these techniques will beapplied to the apparently serial problem of error diffusion .

Designing for Threads
Developers who are unacquainted with parallel programming generallyfeel comfortable with traditional programming models, such asobject-oriented programming (OOP).

In this case, a program begins at a defined point, such as themain() function, and works through a series of tasks in succession. Ifthe program relies on user interaction, the main processing instrumentis a loop in which user events are handled.

From each allowed event – a button click, for example, the programperforms an established sequence of actions that ultimately ends with await for the next user action.

When designing such programs, developers enjoy a relatively simpleprogramming world because only one thing is happening at any givenmoment. If program tasks must be scheduled in a specific way, it'sbecause the developer imposes a certain order on the activities. At anypoint in the process, one step generally flows into the next, leadingup to a predictable conclusion, based on predetermined parameters.

To move from this linear model to a parallel programming model,designers must rethink the idea of process flow. Rather than beingconstrained by a sequential execution sequence, programmers shouldidentify those activities that can be executed in parallel.

To do so, they must see their programs as a set of tasks withdependencies between them. Breaking programs down into these individualtasks and identifying dependencies is known as decomposition .

A problem may be decomposed in several ways: by task, by data, orby data flow. Table 3.1 below summarizes these forms of decomposition. As you shall see shortly,these different forms of decomposition mirror different types ofprogramming activities.

Table3.1 Summary of the Major Forms of Decomposition

Task Decomposition
Decomposing a program by the functions that it performs is called taskdecomposition. It is one of the simplest ways to achieve parallelexecution. Using this approach, individual tasks are catalogued.

If two of them can run concurrently, they are scheduled to do so bythe developer. Running tasks in parallel this way usually requiresslight modifications to the individual functions to avoid conflicts andto indicate that these tasks are no longer sequential.

If we were discussing gardening, task decomposition would suggestthat gardeners be assigned tasks based on the nature of the activity:if two gardeners arrived at a client's home, one might mow the lawnwhile the other weeded. Mowing and weeding are separate functionsbroken out as such.

To accomplish them, the gardeners would make sure to have somecoordination between them, so that the weeder is not sitting in themiddle of a lawn that needs to be mowed.

In programming terms, a good example of task decomposition is wordprocessing software, such as Microsoft Word. When the user opens a verylong document, he or she can begin entering text right away. While theuser enters text, document pagination occurs in the background, as onecan readily see by the quickly increasing page count that appears inthe status bar.

Text entry and pagination are two separate tasks that itsprogrammers broke out by function to run in parallel. Had programmersnot designed it this way, the user would be obliged to wait for theentire document to be paginated before being able to enter any text.Many of you probably recall that this wait was common on early PC wordprocessors.

Data Decomposition
Datadecomposition, also known as data-levelparallelism,breaks down tasks by the data they work on rather than by the nature ofthe task. Programs that are broken down via data decompositiongenerally have many threads performing the same work, just on differentdata items.

For example, consider recalculating the values in a largespreadsheet. Rather than have one thread perform all the calculations,data decomposition would suggest having two threads, each performinghalf the calculations, or n threads performing 1/nth the work.

If the gardeners used the principle of data decomposition to dividetheir work, they would both mow half the property and then both weedhalf the flower beds. As in computing, determining which form ofdecomposition is more effective depends a lot on the constraints of thesystem.

For example, if the area to mow is so small that it does not needtwo mowers, that task would be better done by just one gardener – thatis, task decomposition is the best choice – and data decompositioncould be applied to other task sequences, such as when the mowing isdone and both gardeners begin weeding in parallel.

As the number of processor cores increases, data decompositionallows the problem size to be increased. This allows for more work tobe done in the same amount of time.

To illustrate, consider the gardening example. Two more gardenersare added to the work crew. Rather than assigning all four gardeners toone yard, we can we can assign the two new gardeners to another yard,effectively increasing our total problem size.

Assuming that the two new gardeners can perform the same amount ofwork as the original two, and that the two yard sizes are the same,we've doubled the amount of work done in the same amount of time.

Data Flow Decomposition
Many times, when decomposing a problem, the critical issue isn't whattasks should do the work, but how the data flows between the differenttasks. In these cases, data flow decomposition breaks up a problem byhow data flows between tasks.

The producer/consumer problem is a well known example of how dataflow impacts a programs ability to execute in parallel. Here, theoutput of one task, the producer, becomes the input to another, theconsumer. The two tasks are performed by different threads, and thesecond one, the consumer, cannot start until the producer finishes someportion of its work.

Using the gardening example, one gardener prepares the tools – thatis, he puts gas in the mower, cleans the shears, and other similartasks – for both gardeners to use. No gardening can occur until thisstep is mostly finished, at which point the true gardening work canbegin.

The delay caused by the first task creates a pause for the secondtask, after which both tasks can continue in parallel. In computerterms, this particular model occurs frequently.

In common programming tasks, the producer/consumer problem occursin several typical scenarios. For example, programs that must rely onthe reading of a file fit this scenario: the results of the file I/Obecome the input to the next step, which might be threaded.

However, that step cannot begin until the reading is eithercomplete or has progressed sufficiently for other processing to kickoff. Another common programming example is parsing: an input file mustbe parsed, or analyzed semantically, before the back-end activities,such as code generation in a compiler, can begin.

The producer/consumer problem has several interesting dimensions:

1) Thedependence created between consumer and producer can cause significantdelays if this model is not implemented correctly. Aperformance-sensitive design seeks to understand the exact nature ofthe dependence and diminish the delay it imposes. It also aims to avoidsituations in which consumer threads are idling while waiting forproducer threads.

2) In the idealscenario, the hand-off between producer and consumer is completelyclean, as in the example of the file parser. The output iscontext-independent and the consumer has no need to know anything aboutthe producer. Many times, however, the producer and consumer componentsdo not enjoy such a clean division of labor, and scheduling theirinteraction requires careful planning.

3) If theconsumer is finishing up while the producer is completely done, onethread remains idle while other threads are busy working away. Thisissue violates an important objective of parallel processing, which isto balance loads so that all available threads are kept busy. Becauseof the logical relationship between these threads, it can be verydifficult to keep threads equally occupied.

Next, we'll take a look at the pipeline pattern that allowsdevelopers to solve the producer/consumer problem in a scalablefashion.

Implications of DifferentDecompositions
Different decompositions provide different benefits. If the goal, forexample, is ease of programming and tasks can be neatly partitioned byfunctionality, then task decomposition is more often than not thewinner.

Data decomposition adds some additional code-level complexity totasks, so it is reserved for cases where the data is easily divided andperformance is important.

The most common reason for threading an application is performance.And in this case, the choice of decompositions is more difficult. Inmany instances, the choice is dictated by the problem domain: sometasks are much better suited to one type of decomposition.

But some tasks have no clear bias. Consider for example, processingimages in a video stream. In formats with no dependency between frames,you'll have a choice of decompositions. Should they choose taskdecomposition, in which one thread does decoding, another colorbalancing, and so on, or data decomposition, in which each thread doesall the work on one frame and then moves on to the next?

To return to the analogy of the gardeners, the decision would takethis form: If two gardeners need to mow two lawns and weed two flowerbeds, how should they proceed? Should one gardener only mow – that is,they choose task based decomposition – or should both gardeners mowtogether then weed together?

In some cases, the answer emerges quickly – for instance when aresource constraint exists, such as only one mower. In others whereeach gardener has a mower, the answer comes only through carefulanalysis of the constituent activities. In the case of the gardeners,task decomposition looks better because the start-up time for mowing issaved if only one mower is in use.

Ultimately, you determine the right answer for your application'suse of parallel programming by careful planning and testing. Theempirical timing and evaluation plays a more significant role in thedesign choices you make in parallel programming than it does instandard single-threaded programming.

Challenges You'll Face
The use of threads enables you to improve performance significantly byallowing two or more activities to occur simultaneously. However,developers cannot fail to recognize that threads add a measure ofcomplexity that requires thoughtful consideration to navigatecorrectly.

This complexity arises from the inherent fact that more than oneactivity is occurring in the program. Managing simultaneous activitiesand their possible interaction leads you to confronting four types ofproblems:

1) Synchronizationis the process by which two or more threads coordinate theiractivities. For example, one thread waits for another to finish a taskbefore continuing.

2) Communicationrefers to the bandwidth and latency issues associated with exchangingdata between threads.

3) Load balancingrefers to the distribution of work across multiple threads so that theyall perform roughly the same amount of work.

4) Scalabilityisthe challenge of making efficient use of a larger number of threadswhen software is run on more-capable systems. For example, if a programis written to make good use of four processor cores, will it scaleproperly when run on a system with eight processor cores?

Each of these issues must be handled carefully to maximizeapplication performance. Subsequent chapters describe many aspects ofthese problems and how best to address them on multi-core systems.

Parallel Programming Patterns
For years object-oriented programmers have been using design patternsto logically design their applications. Parallel programming is nodifferent than object-oriented programming – parallel programmingproblems generally fall into one of several well known patterns.

A few of the more common parallel programming patterns and theirrelationship to the aforementioned decompositions are shown in Table 3.2 below.

Table3.2 Common Parallel Programming Patterns

In this section, we'll provide a brief overview of each pattern andthe types of problems that each pattern may be applied to, including:

1) Task-levelParallelism Pattern. In many cases, the best way to achieveparallel execution is to focus directly on the tasks themselves. Inthis case, the task-level parallelism pattern makes the most sense. Inthis pattern, the problem is decomposed into a set of tasks thatoperate independently.

It is often necessary remove dependencies between tasks or separatedependencies using replication. Problems that fit into this patterninclude the so-called embarrassingly parallel problems, those wherethere are no dependencies between threads, and replicated dataproblems, those where the dependencies between threads may be removedfrom the individual threads.

2) Divide andConquer Pattern. In the divide and conquer pattern, the problemis divided into a number of parallel sub-problems. Each sub-problem issolved independently. Once each sub-problem is solved, the results areaggregated into the final solution. Since each sub-problem can beindependently solved, these sub-problems may be executed in a parallelfashion.

The divide and conquer approach is widely used on sequentialalgorithms such as merge sort. These algorithms are very easy toparallelize. This pattern typically does a good job of load balancingand exhibits good locality; which is important for effective cacheusage.

3) GeometricDecomposition Pattern. The geometric decomposition pattern isbased on the parallelization of the data structures used in the problembeing solved. In geometric decomposition, each thread is responsiblefor operating on data 'chunks'. This pattern may be applied to problemssuch as heat flow and wave propagation.

4) PipelinePattern. The idea behind the pipeline pattern is identical tothat of an assembly line. The way to find concurrency here is to breakdown the computation into a series of stages and have each thread workon a different stage simultaneously.

5) WavefrontPattern. The wavefront pattern is useful when processing dataelements along a diagonal in a two-dimensional grid. This is shown in Figure 3.1 below.

Figure3.1 Wavefront Data Access Pattern

The numbers in Figure 3.1 above illustrate the order in whichthe data elements are processed. For example, elements in the diagonalthat contains the number “3” are dependent on data elements “1” and “2”being processed previously.

The shaded data elements in Figure3.1 indicate data that has already been processed. In thispattern, it is critical to minimize the idle time spent by each thread.Load balancing is the key to success with this pattern.

For a more extensive and thorough look at parallel programmingdesign patterns, refer to the book Patternsfor Parallel Programming (Mattson 2004).

Next in Part 2: A motivatingproblem – Error Diffusion

Usedwith the permission of the publisher, Intel Press, this series of twoarticles is based on copyrighted material from “MulticoreProgramming,” by Shameem Akhter and Jason Roberts. This book can bepurchased on line.

Shameem Akhter is a platformarchitect at Intel Corp., focusing on singlesocket multicore architecture and performance analysis. Jason Robertsis a senior software engineer at Intel, working on a number ofdifferent multithreaded software products ranging from desktop,handheld and embedded DSP platforms.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.