Multicore platforms are becoming prevalent, and someone needs toprogram them. Initial multicore experiences for most embeddedprogrammers will be with coherent shared memory systems. Compared tosingle core systems, these shared memory systems are much morechallenging to program correctly.
Nevertheless, with an incremental development and test approach toparallelism and a willingness to apply lessons learned by previousparallel programmers, successful systems are being deployed today usingexisting C/C++ environments.
It's too Hot
The free lunch is over for programmers.  Though Moore's law marches on, and thenumber of economically manufacturable transistors per chip continuesincreasing, clock frequencies have hit a wall because of powerdissipation. It's gotten too hot to handle.
Instead of increasing the clock frequency, designers can use largertransistor budgets to do more work per clock cycle. Within a singleprocessor pipeline, techniques such as instruction-level parallelism ,hardwarethreads, and data-parallel(SIMD) instructions have reached the point of diminishingreturns.
It now makes more hardware sense to add multiple processor cores onchip and turn to task level parallelism. It's left to softwareengineers to properly exploit these multicore architectures.
Multicore systems (Figure 1, below )are typically characterized by number and type ofcores, memory organization, and interconnection network. From aprogramming model perspective, it is useful to consider the memoryarchitecture first.
|Figure1: Multicore Architectures|
Memory architectures can be broadly classified as shared ordistributed. In a typical shared memoryall cores uniformly share the same memory. Cores share information byaccessing the same memory locations.
<>Lightweight threads, defined as multiple instruction streams sharingthe same memory space, are a natural abstraction for a shared memoryprogramming model. The programming model is familiar to multithreadingprogrammers ofsingle core systems. Vendors in both desktop/server and embeddedmarkets offer coherent shared memory systems, so there are a growingnumber of shared memory platforms available to programmers.
In a typical distributed memory system, memory units are closelycoupled to their cores. Each core manages its own memory, and corescommunicate information by sending and receiving data between them.Processes running on different cores and sharing data through messagepassing, are a common abstraction for a distributed memory programmingmodel.
In shared memory systems, data communication is implicit; data isshared between threads simply by accessing the same memory location. Ifthe cores use cache memories, their view of main memory must be keptcoherent between them.
As the number of cores increases, the cost of maintaining coherencebetween caches rises quickly, so it is unlikely this architecture willscale effectively to hundreds of cores.
However, with distributed memory architectures, the hardware designscales relatively easily. Since memory is not shared, the programmermust explicitly describe inter-core communication, and interconnectionnetwork performance becomes important.
Driven by the advantages of matching multiple execution pipelines toshared memory, it's probable that a hybrid on-chip architecture (Figure 2 below ) willemerge as the number of cores per chip increases. This architecture isalready in use at the board level to connect clusters of shared memorychips.
|Figure2: Hybrid Distributed Shared Memory Architecture.|
It is likely that most programmers' initial multicore experiencewill involve some type of shared memory platform. Though theprogramming model appears straightforward, these systems arenotoriously difficult to program correctly.
Why so Hard?
For a fixed number of processors, Amdahl'slaw states that the maximumspeedup is limited by the percentage of a program which cannot beparallelized:
Where N is the number of processors and Ts represents the percentageof work which cannot be parallelized.
Consider a system with N = 4 and Ts = 20%. The maximum achievablespeedup is 2.5x. Even if the number of processors goes to infinity, themaximum speedup is only 5x, limited entirely by the amount of workwhich must remain serialized.
So there is significant pressure to maximize the proportion ofactivity running in parallel. This may involve significant rework ofexisting code into parallel form or switching to alternative, lessfamiliar algorithms which offer more inherent parallelism.
Expressing parallelism requires using new and unfamiliar APIs,leading to programmers introducing bugs while mastering the interfaces.More significantly, parallelism means multiple execution flows.
In a shared memory model, any execution flow can access any memorylocation, so there are many possible orderings of memory access. It'sleft to the programmer to determine that all orderings which can occurwill result in correct operation.
When multiple threads access the same memory location, and at leastone thread is writing to that location, the resultant data will dependon the interleaved execution order of the threads.
This race condition leads to unpredictable results which may bemasked by external effects such as the thread scheduling policy. Properoperation requires forcing partial execution order by using mutualexclusion locks or critical sections.
These devices ensure that a thread has exclusive access to a memorylocation to guarantee the proper ordering of reads and writes.Unfortunately, every lock must be explicitly specified by theprogrammer.
Sequential code is often filled with many order dependencies, whichare implicitly satisfied by the serialized execution of the program.Miss any dependency in the parallel scenario and a latent bug has beencreated.
The improper use of locks can also cause program execution todeadlock. If two threads attempt to lock two memory locations, inopposite order, each thread will obtain one lock but stall foreverwaiting for the opposite one. A chain of data-dependent lockingrelationships can make this difficult to catch.
To be more conservative, a programmer can restrict the order ofexecution by the liberal use of locks. Besides increasing the chance ofdeadlock, heavy-handed locking raises the amount of serializationwhich, when considering the locking overhead, quickly reduces theperformance of the parallel program to less than the originalsequential version.
From a testing perspective, the quality of tests for sequential codecan be measured by path coverage. There are a significant number ofpossible paths, and stimulating paths deep in the code can bedifficult.
The number of composite paths for parallel threads increasesexponentially, and explicitly controlling those paths may beimpossible. Experience from the world of hardware verification suggeststhat concurrent execution requires extensive time and resourcesdedicated to testing.
Nevertheless, shared memory systems are being successfully deployed.
Parallelize incrementally, Testfrequently
With all these potential pitfalls, it seems appropriate to adopt a”first do no harm” mentality. The greatest challenge is usually not infinding the parallelism, but in exploiting it in a correct andefficient manner.
An initial methodology starts with sequential code andincrementallyadds parallelism, testing the changes at each step of the way. Theoriginal code base should be functionally correct with a goodtest suite. The test suite should contain thorough black-box systemtests which exercise the code without knowledge of how thefunctionality is implemented.
It should also include a profiling suite which pinpoints whichportions of the code require the most execution time over a range ofoperating conditions. Considering Amdahl's law; it is crucial to focusfirst on areas in the code base which account for most of the executiontime.
It's important to understand the original implementation. Leverageprofiling and analysis tools to understand the ordering dependencies inthe code that restrict how parallelism can be applied. Two types ofdependenciesto consider are read-after-write(RAW) andwrite-after-read (WAR).
RAW represents true data dependencies. In this case, a value cannotbe read before it is written. This is satisfied in original code by thesequential statement ordering, but in parallel code, the read must comeafter the write, and no other write to the same location can occurbetween them. A write lock can be used to guarantee this ordering.
WAR dependencies are also known as anti-dependencies. In this case,a memory location cannot be written until after the previous value hasbeen read. These dependencies often occur in sequential programs whenmemory is reused in different parts of a program to save space. Thedependencies can be overcome by duplicating or buffering storageresources to account for overlapped execution.
On the practical side, it is important to understand the targetplatform. It makes sense to match the number of runnable threads to thenumber of execution pipelines.
For example, if running on a quad core platform where each coresupports two hardware threads, eight runnable threads is a good target.Increasing the number of threads way beyond that will require overheadfor thread startup, switching, and shutdown, eventually degrading theoverall performance instead of accelerating it.
Data layout which is critical in single core systems is even morecritical in shared memory architectures. Separate threads should accessnon-overlapping memory locations as much as possible to minimizecontention between individual core caches or expect to pay a high priceto maintain coherency.
Fortunately, there has been significant work performed inidentifying best practices for decomposing programs into parallelregions. These patterns involve decomposing a problem in ways whichmaximize parallelism while satisfying data dependencies.
(See  for a comprehensive collectionof parallel programmingpatterns applied at various levels of abstraction. A lot of insight canbe gained by leveraging the experiences of others in similarsituations. )
Understanding the original program, the target platform, and bestpractices all influence the refactoring of code from sequential to aproper parallel implementation. Each incremental change that increasesparallelism also has the potential to introduce bugs, so take smallsteps and test frequently.
Example: Sobel Edge DetectionExample
An implementation of the Sobel edgedetection algorithm is used as asimple example to demonstrate different approaches for sequential toparallel code transformation.
The Sobel algorithm approximates the two dimensional gradient of theimage at each point. The magnitude of the gradient represents howquickly the image is changing. A large value indicates a likely edgetransition and shows up as white on the transformed image as shown forthe standard grayscaleLena. (Figure 3 below ).
|Figure3: Grayscale Lena|
The gradient is approximated by first estimating the horizontal andvertical derivatives using two 3×3 matrices:
The magnitude is further approximated by summing the absolute valueof the gradients:
Listing 1 below is a Clanguage implementation of the Sobelfunction.
|Listing1: Original Sobel Function|
The Sobel function is sensitive to noise, so it is typical to runthe image through a smoothing filter first. A linear 5×5 filter isshown in Listing 2 below .
|Listing2: Original Filter Function|
The results of running the smoothing filter and Sobel functionsequentially on the Lena image are shown in Figure 3 above . Profilingthe two functions shows that the smoothing function takes approximatelytwice as long as the Sobel function.
Next in Part 2: Mulththreading in C
Skip Hovsmith is the Director ofApplication Engineering in the USfor CriticalBlue , working onaccelerating embedded software throughsynthesis of programmable multicore processors for structured ASICs,SoCs, and FPGAs. Prior to CriticalBlue, Skip worked for several startups in formal verification, FPGA design, and enterprise softwareservices, and at National Semiconductor working on virtual prototyping,hw/sw co-design, reconfigurable systems, and standard cell productdesign.
 Sutter, Herb. “The Free Lunch Is Over: A Fundamental TurnTowardConcurrency in Software.” Dr. Dobb's Journal, 30(3), March2005.
 Mattson, Timothy G.,Beverly A. Sanders, and Berna L. Massingill,Patternsfor Parallel Programming. Boston: Addison-Wesley, 2005.
 Moreabout multicores and multiprocessors