The promises and challenges of multicore concurrency - Embedded.com

The promises and challenges of multicore concurrency

Editor’s Note: Excerpted from Real World Multicore Embedded Systems by Byron Moyer, the author provides a basic outline of the principles of concurrency, a fundamental mechanism by which multicore systems manage and coordinate multiple tasks in parallel to achieve higher performance.

The opportunities and challenges that arise from multicore technology — or any kind of multiple processor arrangement — are rooted in the concept of concurrency. You can loosely conceive of this as 'more than one thing happening at a time'. But when things happen simultaneously, it’s very easy for chaos to ensue. If you create an 'assembly line' to make burgers quickly in a fast food joint, with one guy putting the patty on the bun and the next guy adding a dab of mustard, things will get messy if the mustard guy doesn’t wait for a burger to be in place before applying the mustard. Coordination is key, and yet, as obvious as this may sound, it can be extremely challenging in a complex piece of software.

Concurrency fundamentals
It is first important to separate the notion of inherent concurrency and implemented parallelization. A given algorithm or process may be full of opportunities for things to run independently from each other. An actual implementation will typically select from these opportunities a specific parallel implementation and go forward with that.

For example, in our burger-making example, you could make burgers more quickly if you had multiple assembly lines going at the same time. In theory, given an infinite supply of materials, you could make infinitely many burgers concurrently. However, in reality, you only have
a limited number of employees and countertops on which to do the work. So you may actually implement, say, two lines even though the process inherently could allow more. In a similar fashion, the number of processors and other resources drives the decision on how much parallelism to implement.

It’s critical to note, however, that a chosen implementation relies on the inherent opportunities afforded by the algorithm itself. No amount of parallelization will help an algorithm that has little inherent concurrency, as we’ll explore later in this chapter.

So what you end up with is a series of program sections that can be run independently punctuated by places where they need to “check in” with each other to exchange data — an event referred to as “synchronization.”

For example, one fast food employee can lay a patty on a bun completely independently from someone else squirting mustard on a different burger. During the laying and squirting processes, the two can be completely independent. However, after they’re done, each has to pass his or her burger to the next guy, and neither can restart with a new burger until a new one is in place. So if the mustard guy is a lot faster than the patty-laying guy, he’ll have to wait idly until the new burger shows up. That is a synchronization point (Figure 1 ).

Figure 1: Where the two independent processes interact is a synchronization point.

A key characteristic here is the fact that the two independent processes may operate at completely different speeds, and that speed may not be predictable. Different employees on different shifts, for example, may go at different speeds. This is a fundamental issue for parallel execution of programs. While there are steps that can be taken to make the relative speeds more predictable, in the abstract, they need to be considered unpredictable. This concept of a program spawning a set of independent processes with occasional check-in points is shown in Figure 2 .

Figure 2: A series of tasks run mutually asynchronously with occasional synchronization points.

Depending on the specific implementation, the independent portions of the program might be threads or processes (Figure 3 ). At this stage, we’re really not interested in those specifics, so to avoid getting caught up in that detail, they are often generically referred to as “tasks.. In this article, we will focus on tasks.

Figure 3: Tasks can be different threads within a process or different processes.

Two kinds of concurrency
There are fundamentally two different ways to do more than one thing at a time: bulk up so that you have multiple processors doing the same thing, or use division of labor, where different processors do different things at the same time.

Data parallelism. The first of those is the easiest to explain. Let’s say you’ve got a four-bit vector that you want to operate on. Let’s make it really simple for the sake of example and say that you need to increment the value of every entry in the vector. In a standard program, you would do this with a loop:

     for i=1 to 4 {
        increment the ith value
     {

This problem is exceedingly easy to parallelize. In fact, it belongs to a general category of problems whimsically called 'embarrassingly parallel' (Figure 4 ) Each vector entry is completely independent and can be incremented completely independently. Given four processors, you could easily have each processor work on one of the entries and do the entire vector in 1/4 the time it takes to do it on a single processor.

Figure 4: Embarrassingly parallel computation.

In fact, in this case, it would probably be even less than 1/4 because you no longer have the need for an iterator — the i in the pseudocode above; you no longer have to increment i each time and compare it to 4 to see if you’re done (Figure 5 ). This is referred to as data parallelism; multiple instances of data can be operated on at the same time. The inherent concurrency allows a four- fold speed-up, although a given implementation might choose less if fewer than four processors are available.

Figure 5: Looping in a single core takes more cycles than multicore.

Two key attributes of this problem make it easy to parallelize: (1) the operation being performed on one entry doesn’t depend on any other entry, and (2) the number of entries is known and fixed.

That second one is important. If you’re trying to figure out how to exploit concurrency in a way that’s static — in other words, you know exactly how the problem will be parallelized at compile time — then the number of loop iterations must be known at compile time. A 'while' loop or a 'for' loop where the endpoint is calculated instead of constant cannot be so neatly parallelized because, for any given run, you don’t know how many parallel instances there might be.Functional parallelism. The other way of splitting thingsup involves giving different processors different things to do. Let’stake a simple example where we have a number of text files and we wantto cycle through them to count the number of characters in each one. Wecould do this with the following pseudo-program:

     for each file {
        Open the file
        Count the characters
        Close the file
     {

Wecan take three processors and give each of them a different task. Thefirst processor opens files; the second counts characters; and the thirdcloses files (Figure 6 ).

Figure 6: Different cores performing different operations.

Thereis a fundamental difference between this and the prior example of dataparallelism. In the vector-increment example, we took a problem that hadbeen solved by a loop and completely eliminated the loop. In this newexample, because of the serial nature of the three tasks, if you onlyhad one loop iteration, then there would be no savings at all. It onlyworks if you have a workload involving repeated iterations of this loop.

As illustrated in Figure 7 ,when the first file is opened, the second and third processors sitidle. After one file is open, then the second processor can count thecharacters, while the third processor is still idle. Only when the thirdfile is opened do all processors finally kick in as the third processorcloses the first file. This leads to the descriptive term 'pipeline'for this kind of arrangement, and, when executing, it doesn’t really hitits stride until the pipeline fills up. This is also referred to as'loop distribution' because the duties of one loop are distributed intomultiple loops, one on each processor.

This figure also illustrates the fact that using this algorithm on only one file provides no benefit whatsoever.

Real-worldprograms and algorithms typically have both inherent data andfunctional concurrency. In some situations, you can use both. Forexample, if you had six processors, you could double the three-processorpipeline to work through the files twice as fast. In other situations,you may have to decide whether to exploit one or the other in yourimplementation.

Figure 7: The pipeline isn’t full until all cores are busy.

Oneof the challenges of a pipeline lies in what’s called balancing thepipeline. Execution can only go as fast as the slowest stage. In Figure7, opening files is shown as taking longer than counting the characters.In that situation, counting faster will not improve performance; itwill simply increase the idle time between files.

The idealsituation is to balance the tasks so that every pipeline stage takes thesame amount of time; in practice, this is so difficult as to be more orless impossible. It becomes even harder when different iterations takemore or less time. For instance, it will presumably take longer to countthe characters in a bigger file, so really the times for countingcharacters above should vary from file to file. Now it’s completelyimpossible to balance the pipeline perfectly.

Dependencies
Oneof the keys to the simple examples we’ve shown is the independence ofoperations. Things get more complicated when one calculation depends onthe results of another. And there are a number of ways in which thesedependencies crop up. We’ll describe some basic cases here, but acomplete theory of dependencies can be quite intricate.

It bearsnoting here that this discussion is intended to motivate some of the keychallenges in parallelizing software for multicore. In general, oneshould not be expected to manually analyze all of the dependencies in aprogram in order to parallelize it; tools become important for this. Forthis reason, the discussion won’t be exhaustive, and will show conceptexamples rather than focusing on practical ways of dealing withdependencies.

Producers and consumers. Dependencies are easier to understand if you think of a program of consisting of producers and consumers of data (Figure 8 ).Some part of the program does a calculation that some other part willuse: the first part is the producer and the second is the consumer. Thishappens at very fine-grained instruction levels and at higher levels,especially if you are taking an object-oriented approach — objects arealso producers and consumers of data.

Figure 8: Producers and consumers at the fine- and coarse-grained level. Entities are often both producers and consumers.

Atits most basic, a dependency means that a consumer of data must wait toconsume its data until the producer has produced the data (Figure 9 ).The concept is straightforward, but the implications vary depending onthe language and approach taken. At the instruction level, manycompilers have been designed to exploit low-level concurrency, doingthings like instruction reordering to make execution more efficientwhile making sure that no dependencies are violated.

Figure 9: A consumer cannot proceed until it gets its data from the producer.

Itgets more complicated with languages like C that allow pointers. Theconcept is the same, but compilers have no way of understanding howvarious pointers relate, and so can’t do any optimization. There are tworeasons why this is so: pointer aliasing and pointer arithmetic.

Pointeraliasing is an extremely common occurrence in a C program. If you have afunction that takes a pointer to, say, an image as a parameter, thatfunction may name the pointer imagePtr . If a program needs to call that function on behalf of two different images — say, leftImage and rightImage , then when the function is called with leftImage as the parameter, then leftImage and imagePtr will refer to the same data. When called for rightImage , then rightImage and imagePtr will point to the same data (Figure 10 ).

Figure 10: Different pointers may point to the same locations at different times.

Thisis referred to as aliasing because a given piece of data may beaccessed by variables of different names at different times. There’s noway to know statically what the dependencies are, not only because thenames look completely different, but also because they may change as theprogram progresses. Thorough dynamic analysis is required to understandthe relationships between pointers.

Pointer arithmetic can alsobe an obvious problem because, even if you know where a pointer startsout, manipulating the actual address being pointed to can result in thepointer pointing pretty much anywhere (including address 0, which any Cprogrammer has done at least once in his or her life). Where it ends uppointing may or may not correlate to a memory location associated withsome other pointer (Figure 11 ).

Figure11: Pointer arithmetic can cause a pointer to refer to some location inmemory that mayor may not be pointed to by some other pointer.

Forexample, when scanning through an array with one pointer to makechanges, it may be very hard to understand that some subsequentoperation, where a different pointer scans through the same array(possibly using different pointer arithmetic), will read that data (Figure 12 ).If the second scan consumes data that the first scan was supposed toput into place, then parallelizing those as independent will cause theprogram to function incorrectly. In many cases, this dependency cannotbe identified by static inspection; the only way to tell is to notice atrun time that the pointers address the same space.

Figure 12: Two pointers operating on the same array create a dependency that isn’t evident bystatic inspection.

Thesedependencies are based on a consumer needing to wait until the producerhas created the data: writing before reading. The opposite situationalso exists: if a producer is about to rewrite a memory location, youwant to be sure that all consumers of the old data are done before youoverwrite the old data with new data (Figure 13 ). This is calledan anti-dependency”. Everything we’ve discussed about dependencies alsoholds for anti-dependencies except that this is about waiting to writeuntil all the reads are done: reading before writing.

Figure13: The second pointer must wait before overwriting data until thefirst pointer hascompleted its read, creating an anti-dependency.

Loops and dependencies
Dependenciesbecome more complex when loops are involved — and in programs beingtargeted for parallelization — loops are almost always involved. We sawabove how an embarrassingly parallel loop can be parallelized so thateach processor gets one iteration of the loop. Let’s look at an codebelow that’s slightly different from that example.

     for i=1 to 4 {
        add the (i-1)th value to the ith value
     {

Notethat in this and all examples like this, I’m ignoring what happens forthe first iteration, since that detail isn’t critical for thediscussion.

This creates a subtle change because each loopiteration produces a result that will be consumed in the next loopiteration. So the second loop iteration can’t start until the firstiteration has produced its data. This means that the loop iterations canno longer run exactly in parallel: each of these parallel iterations isoffset from its predecessor (Figure 14 ).

Figure14: Even though iterations are parallelized, each must wait until itsneeded data is produced by the prior iteration, causing offsets thatincrease overall computationtime above what would be required forindependent iterations.

While the totalcomputation time is still less than required to execute the loop on asingle processor, it’s not as fast as if there were no dependenciesbetween the loop iterations. Such dependencies are referred to as'loop-carry' (or 'loop-carried') dependencies. It gets even morecomplicated when you have nested loops iterating across multipleiterators, as shown in the code snippet below.

Let’s say you’re traversing a two-dimensional matrix using i to scan along a row and using j to scan down the rows (Figure 15 ).

Figure 15: A 4 X 4 array with i iterating along a row (inner loop) and j iterating down the rows (outer loop).

And let’s assume further that a given cell depends on the new value of the cell directly above it (Figure 16 ):

     for j=1 to 4 {
        for i=1 to 4 {
           add the (i, j-1)th value to the (i,j)th value
        {
     {
       
then the loop distance for j is 2, as shown in Figure 20.

Figure 16: Each cell gets a new value that depends on the new value in the cell in the prior row.

Firstof all, there are lots of ways to parallelize this code, depending onhow many cores we have. If we were to go as far as possible, we wouldneed 16 cores since there are 16 cells. Or, with four cores, we couldassign one row to each core. If we did the latter, then we couldn’tstart the second row until the first cell of the first row wascalculated (Figure 17 ).

Figure 17: If each row gets its own core, then each row must wait until the first cell in the priorrow is done before starting.

Ifwe completely parallelized it, then we could start all of the first-rowentries at the same time, but the second-row entries would have to waituntil their respective first-row entries were done (Figure 18 ).

Figure 18: An implementation that assigns each cell to its own core.

Notethat using so many cores really doesn’t speed anything up: using onlyfour cores would do just as well since only four cores would beexecuting at any given time (Figure 19 ).

Figure 19: Four cores can implement this loop in the same time as 16.

Thisimplementation assigns one column to each core, instead of one row, asis done in Figure 17. As a result, the loop can be processed fasterbecause no core has to wait for any other core. There is no way toparallelize this set of nested loops any further because of thedependencies.

Such nested loops give rise to the concept of 'loopdistance'. Each iterator gets a loop distance. So in the above example,in particular as shown in Figure 16 , where the arrows show the dependency, the loop distance for i is 0 since there is no dependency; the loop distance for j is 1, since the data consumed in one cell depends on the cell directly above it, which is the prior j iteration. As a 'vector', the loop distance for i and j is [0,1].

If we changed the code slightly to make the dependency on j- 2 instead of j- 1:

     for j=1 to 4 {
        for i=1 to 4 {
           add the (i, j-2)th value to the (i, j)th value
        {
     {
   

Then the loop distance for j is 2 as is shown in Figure 20 .

Figure 20: Example showing j loop distance of 2.

Thismeans that the second row doesn’t have to wait for the first row, sinceit no longer depends on the first row. The third row, however, doeshave to wait for the first row (Figure 21 ). Thus we canparallelize further with more cores, if we wish, completing the task inhalf the time required for the prior example.

Figure 21: With loop distance of 2, two rows can be started in parallel.

Whileit may seem obscure, the loop distance is an important measure forsynchronizing data. It’s not a matter of one core producing data and theother immediately consuming it; the consuming core may have to wait anumber of iterations before consuming the data, depending on how thingsare parallelized. While it’s waiting, the producer continues with itsiterations, writing more data. Such data can be, for example, writteninto some kind of first-in/first-out (FIFO) memory, and the loopdistance determines how long that FIFO has to be.

Let’s take the prior example and implement it with only four cores instead of eight, as shown in Figure 22 .

Figure 22: A four-core implementation with loop distance [0,2].

Let’slook at Core 1. When it’s done with cell [1,1], it must move on to cell[1,2]. But cell [1,3] needs the result from [1,1]. Strictly speaking,this is an anti-dependency: the [1,1] result must be kept around until[1,3] reads it. Depending on how we implement things, cell [1,2] mightdestroy the result.

Now, as shown above, we can really justimplement this as an array in each core, keeping all the resultsseparate. But in some multicore systems, the operating system willdetermine which cores get which threads, and if each cell is spawned as athread, then things could be assigned differently. For example, thefirst two cores might exchange the last two cells (Figure 23 ).

Figure 23: The first two of four cores, with a different assignment of cells.

NowCore 1 has to hand its results to Core 2 (and vice versa, notillustrated to avoid clutter). The solution is for Core 1 to put theresult of [1,1] somewhere for safekeeping until [1,3] is ready for it.Then [1,2] can proceed, and Core 2 can pick up the result it needs whenit’s ready.
But the [1,2] result will also be ready before Core 2 isready for the [1,1] result. So the [1,2] result can’t just be put in thesame place as the [1,1] result or it will overwrite it.

Onesolution, at the risk of getting into implementation details, is to usesome kind of FIFO structure between Core 1 and Core 2 (Figure 24 ). Because the loop distance for j is 2, the FIFO needs to be at least 2 deep to avoid stalling things.Additionally, by using a FIFO instead of trying to hard-code an arrayimplementation, the solution is robust against any arbitrary threadassignments that the operating system may make.

Figure 24: FIFO used to communicate results between cores. The minimum FIFO size is related to the loop distance.

FIFOsare sometimes thought to be expensive, depending on how they areimplemented. The intent here isn’t to focus on the details of the FIFO,but rather to illustrate its relationship to the loop distance.

Manualdetermination of loop distance can, frankly, be quite confusing. Infact, the body of a loop may have numerous variables, each withdifferent loop distances. Branches further complicate things. Theexistence of tools to handle this will be covered in a subsequentchapter. Because of these tools, we will not delve further into theintricacies, but rather leave the discussion here as a motivation of theconcept of loop distance as it shows up in tools.

Shared resources
Thesecond major challenge that concurrent tasks present is the fact thatdifferent tasks may need to access the same resources at the same time.For the most part, the challenges are exactly the same as thosepresented by a multi-threaded program on a single-core system. The useof critical sections and locks and their ilk proceeds exactly as before.

However,the implementations of solutions that work for single-core systems maynot work for multicore systems. For example, one simple brute-force wayto block any other thread from interrupting a critical section of codeis to suspend interrupts while within that critical section. While thatmight work for a single core, it doesn’t work if there is another corethat could be accessing (or corrupting) a shared memory location.

Theother new concept that multicore adds is the fact that each core hasits own cache, and global data replicated in the cache may at times beout of sync with the latest version. And this gets complex because cachecoherency strategies can themselves be complex, and different platformswill have different schemes.

So, while a programmer can ignorethe cache on a single-core system, that’s no longer possible formulticore, and, as we’ll see in subsequent chapters, the handling ofsynchronization may depend on the caching strategy of the platform.

Summary
Allof the challenges of multicore computing arise from concurrency, thefact that different things may happen at the same time. If we’re used toevents occurring in a prescribed order, then it can require a bit ofmental gymnastics to get used to the idea that two operations in twodifferent parallel threads may happen in any order with respect to eachother.

Bryon Moyer is a technology writer and editorwho has over 30 years’ experience as an engineer and marketer in SiliconValley, having worked for MMI, AMD, Cypress, Altera, Actel, TejaTechnologies, and Vector Fabrics. He has focused on PLDs/FPGAs, EDA,embedded systems, multicore processing, networking protocols, andsoftware analysis. He has a BSEE from UC Berkeley and an MSEE from SantaClara University.

Used with permission from Newnes, an imprint of Elsevier, Copyright 2013, this article was excerpted from Real World Multicore Embedded Systems by Bryon Moyer.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.