When most people perform calculations by hand, they are limited by howfast they can do the calculations, not how fast they can read andwrite. Early microprocessors were similarly constrained.
In recent decades, microprocessors have grown much faster in speedthan in memory. A single microprocessor core can execute hundreds ofoperations in the time it takes to read or write a value in mainmemory.
Programs now are often limited by the memory bottleneck, notprocessor speed. Multi-core processors can exacerbate the problemunless care is taken to conserve memory bandwidth and avoid memorycontention.
To conserve bandwidth, pack data more tightly, or move it lessfrequently between cores. Packing the data tighter is usuallystraightforward, and benefits sequential execution as well. Forexample, pack Boolean arrays into one Boolean value per bit, not onevalue per byte.
Use the shortest integer type that can hold values in the requiredrange. When declaring structures in C/C++, declare fields in order ofdescending size. This strategy tends to minimize the extra padding thatthe compiler must insert to maintain alignment requirements, asexemplified in Figure 7.15 below.
|Figure7.15. Order Fields by Decreasing Size to Reduce Padding|
Some compilers also support “#pragma pack” directives that packstructures even more tightly, possibly by removing all padding. Suchvery tight packing may be counterproductive, however, because it causesmisaligned loads and stores that may be significantly slower thanaligned loads and stores.
Working in the Cache
Moving data less frequently is a more subtle exercise than packing,because mainstream programming languages do not have explicit commandsto move data between a core and memory. Data movement arises from theway the cores read and write memory. There are two categories ofinteractions to consider: those between cores and memory, and thosebetween cores.
Data movement between a core and memory also occurs in single-coreprocessors, so minimizing data movement benefits sequential programs aswell. There exist numerous techniques.
For example, a technique called cache-oblivious blocking recursivelydivides a problem into smaller and smaller subproblems. Eventually thesubproblems become so small that they each fit in cache. The Fastest FourierTransform in the West (Frigo 1997) uses this approach and indeedlives up to its name.
|Figure7.16. Cache-Unfriendly Sieve of Eratosthenes|
Another technique for reducing the cache footprint is to reordersteps in the code. Sometimes this is as simple as interchanging loops.Other times it requires more significant restructuring.
The Sieve ofEratosthenes is an elementary programming exercise thatdemonstrates such restructuring and its benefits. Figure 7.16 above presents the Sieveof Eratosthenes for enumerating prime numbers up to n.
This version has two nested loops: the outer loop finds primes, andthe inner loop, inside function Strike, strikes out composite numbers.This version is unfriendly to cache, because the inner loop is over thefull length of array composite, which might be much larger than whatfits in cache.
Figure 7.17 below shows howthe sieve can be restructured to be cache friendly. Instead of directlyrepresenting the conceptual sieve as one big array, it represents it asa small window into the conceptual sieve. The window size isapproximately the square root of n bytes.
|Figure7.17. Cache-Friendly Sieve of Eratosthenes|
The restructuring requires that the original inner loop be stoppedwhen it reaches the end of a window, and restarted when processing thenext window. The array striker stores the indices of these suspendedloops, and has an element for each prime up to the square root of n.
The data structures grow much more slowly than n, and so fit in a106 byte cache even when n approaches values as large as 1011. Ofcourse, allocating array composite to hold 1011 bytes is impractical onmost machines. The later discussion of multi-threading the sievedescribes how to reduce composite to the square root of n bytes insteadof n bytes.
The restructuring introduces extra complexity and bookkeepingoperations. But because processor speed so greatly outstrips memoryspeed, the extra bookkeeping pays off dramatically.
Figure 7.18 below shows thisperformance difference. On this log plot, the cache friendly code has afairly straight performance plot, while the cache unfriendly version'srunning time steps up from one straight line to another when n reachesapproximately 106.
|Figure7.18. Performance Difference between Figure 7.16 and Figure 7.17|
The step is characteristic of algorithms that transition fromrunning in cache to running out of cache as the problem size increases.The restructured version is five times faster than the original versionwhen n significantly exceeds the cache size, despite the extraprocessor operations required by the restructuring.
For multi-core programs, working within the cache becomes trickier,because data is not only transferred between a core and memory, butalso between cores. As with transfers to and from memory, mainstreamprogramming languages do not make these transfers explicit. Thetransfers arise implicitly from patterns of reads and writes bydifferent cores. The patterns correspond to two types of datadependencies:
1) Read-write dependency. Acore writes a cache line, and then a different core reads it.
2) Write-write dependency. Acore writes a cache line, and then a different core writes it.
An interaction that does not cause data movement is two coresrepeatedly reading a cache line that is not being written. Thus ifmultiple cores only read a cache line and do not write it, then nomemory bandwidth is consumed. Each core simply keeps its own copy ofthe cache line.
To minimize memory bus traffic, minimize core interactions byminimizing shared locations. Hence, the same patterns that tend toreduce lock contention also tend to reduce memory traffic, because itis the shared state that requires locks and generates contention.Letting each thread work on its own local copy of the data and mergingthe data after all threads are done can be a very effective strategy.
Consider writing a multi-threaded version of the functionCacheFriendlySieve from Figure 7.17. A good decomposition for thisproblem is to fill the array factor sequentially, and then operate onthe windows in parallel. The sequential portion takes time O(), andhence has minor impact on speedup for large n.
Operating on the windows in parallel requires sharing some data.Looking at the nature of the sharing will guide you on how to write theparallel version, to wit:
1) The array factor isread-only once it is filled. Thus each thread can share the array.
2) The array composite isupdated as primes are found. However, the updates are made to separatewindows, so they are unlikely to interfere except at window boundariesthat fall inside a cache line. Better yet, observe that the values inthe window are used only while the window is being processed.
The array composite no longer needs to be shared, and instead eachthread can have a private portion that holds only the window ofinterest. This change benefits the sequential version too, because nowthe space requirements for the sieve have been reduced from O(n) toO(). The reduction in space makes counting primes up to 1011 possibleon even a 32-bit machine.
3) The variable count isupdated as primes are found. An atomic increment could be used, butthat would introduce memory contention. A better solution, as shown inthe example, is to give each thread perform a private partial count,and sum the partial counts at the end.
4) The array striker isupdated as the window is processed. Each thread will need its ownprivate copy. The tricky part is that striker induces a loop-carrieddependence between windows. For each window, the initial value ofstriker is the last value it had for the previous window. To break thisdependence, the initial values in striker have to be computed fromscratch. This computation is not difficult. The purpose of striker[k]is to keep track of the current multiple of factor[k].
5) The variable base is newin the parallel version. It keeps track of the start of the window forwhich striker is valid. If the value of base differs from the start ofthe window being processed, it indicates that the thread must recomputestriker from scratch. The recomputation sets the initial value ofstriker[k] to the lowest multiple of factor[k] that is inside or afterthe window.
Figure 7.19 below shows themulti-threaded sieve. A further refinement that cuts the work in halfwould be to look for only odd primes.
|Figure7.19. Parallel Sieve of Eratosthenes|
The refinement was omitted from the examples because it obfuscatesunderstanding of the multi-threading issues.Cache-related Issues
As remarked earlier in the discussion of time-slicing issues, goodperformance depends on processors fetching most of their data fromcache instead of main memory.
For sequential programs, modern caches generally work well withouttoo much thought, though a little tuning helps. In parallelprogramming, caches open up some much more serious pitfalls.
The smallest unit of memory that two processors interchange is a cacheline or cache sector. Two separate caches can share a cache line whenthey both need to read it, but if the line is written in one cache, andread in another, it must be shipped between caches, even if thelocations of interest are disjoint.
Like two people writing in different parts of a log book, the writesare independent, but unless the book can be ripped apart, the writersmust pass the book back and forth. In the same way, two hardwarethreads writing to different locations contend for a cache sector tothe point where it becomes a ping-pong game.
|Figure7.20. Cache Line Ping Ponging Caused by False Sharing|
Figure 7.20 above illustrates such a ping-pong game. There are two threads, each runningon a different core. Each thread increments a different locationbelonging to the same cache line. But because the locations belong tothe same cache line, the cores must pass the sector back and forthacross the memory bus.
Figure 7.21 below shows howbad the impact can be for a generalization of Figure 7.20. Foursingle-core processors, each enabled with Hyper-Threading Technology(HT Technology), are used to give the flavor of a hypothetical futureeight-core system.
Each hardware thread increments a separate memory location. The iththread repeatedly increments x[i*stride]. The performance is worse whenthe locations are adjacent, and improves as they spread out, becausethe spreading puts the locations into more distinct cache lines.
Performance improves sharply at a stride of 16. This is because thearray elements are 4-byte integers. The stride of 16 puts the locations16?? 4 = 64 bytes apart. The data is for a Pentium 4 based processorwith a cache sector size of 64 bytes.
Hence when the locations were 64 bytes part, each thread is hittingon a separate cache sector, and the locations become private to eachthread. The resulting performance is nearly one hundredfold better thanwhen all threads share the same cache line.
|Figure7.21. Performance Impact of False Sharing|
Avoiding false sharing may require aligning variables or objects inmemory on cache line boundaries. There are a variety of ways to forcealignment. Some compilers support alignment pragmas. The Windowscompilers have a directive __declspec(align(n)) that can be used tospecify n-byte alignment. Dynamic allocation can be aligned byallocating extra pad memory, and then returning a pointer to the nextcache line in the block.
Figure 7.22 below shows anexample allocator that does this. Function CacheAlignedMalloc uses theword just before the aligned block to store a pointer to the true baseof the block, so that function CacheAlignedFree can free the trueblock.
Notice that if malloc returns an aligned pointer, CacheAlignedMallocstill rounds up to the next cache line, because it needs the firstcache line to store the pointer to the true base.
It may not be obvious that there is always enough room before thealigned block to store the pointer. Sufficient room depends upon twoassumptions:
1) A cache line is at leastas big as a pointer.
2) A malloc request for atleast a cache line's worth of bytes returns a pointer aligned onboundary that is a multiple of sizeof(char*).
These two conditions hold for IA-32 and Itanium-based systems.Indeed, they hold for most architecture because of alignmentrestrictions specified for malloc by the C standard.
|Figure7.22. Memory Allocator that Allocates Blocks Aligned on Cache LineBoundaries|
The topic of false sharing exposes a fundamental tension betweenefficient use of a single-core processor and efficient use of amulti-core processor. The general rule for efficient execution on asingle core is to pack data tightly, so that it has as small afootprint as possible.
But on a multi-core processor, packing shared data can lead to asevere penalty from false sharing. Generally, the solution is to packdata tightly, give each thread its own private copy to work on, andmerge results afterwards.
This strategy extends naturally to task stealing. When a threadsteals a task, it can clone the shared data structures that might causecache line ping ponging, and merge the results later.
At any given instant in time in a sequential program, memory has a welldefined state. This is called sequential consistency. In parallelprograms, it all depends upon the viewpoint.
Two writes to memory by a hardware thread may be seen in a differentorder by another thread. The reason is that when a hardware threadwrites to memory, the written data goes through a path of buffers andcaches before reaching main memory. Along this path, a later write mayreach main memory sooner than an earlier write.
Similar effects apply to reads. If one read requires a fetch frommain memory and a later read hits in cache, the processor may allow thefaster read to “pass” the slower read. Likewise, reads and writes mightpass each other.
Of course, a processor has to see its own reads and writes in theorder it issues them, otherwise programs would break. But the processordoes not have to guarantee that other processors see those reads andwrites in the original order. Systems that allow this reordering aresaid to exhibit relaxed consistency.
Because relaxed consistency relates to how hardware threads observeeach other's actions, it is not an issue for programs running timesliced on a single hardware thread. Inattention to consistency issuescan result in concurrent programs that run correctly on single-threadedhardware, or even hardware running with HT Technology, but fail whenrun on multi-threaded hardware with disjoint caches.
The hardware is not the only cause of relaxed consistency. Compilersare often free to reorder instructions. The reordering is critical tomost major compiler optimizations.
For instance, compilers typically hoist loop-invariant reads out ofa loop, so that the read is done once per loop instead of once per loopiteration. Language rules typically grant the compiler license topresume the code is single-threaded, even if it is not.
This is particularly true for older languages such as Fortran, C,and C++ that evolved when parallel processors were esoteric. For recentlanguages, such as Java and C#, compilers must be more circumspect, butonly when the keyword volatile is present.
Unlike hardware reordering, compiler reordering can affect code evenwhen it is running time sliced on a single hardware thread. Thus theprogrammer must be on the lookout for reordering by the hardware or thecompiler.
To read Part 1, go to “Threads, data races, deadlocks and livelocks.”
To read Part 2, go to ” Heavily contendedlocks and priority inversion.“
To read Part 3, go to “Nonblockingalgorithms, ping-ponging, and memory reclamation.”
Thisarticle was excerpted from Multi-CoreProgrammingby Shameem Akhter and Jason Roberts.Copyright © 2006 Intel Corporation. All rights reserved.
ShameemAkhter is a platform architect at Intel Corporation, focusing on singlesocket multi-core architecture and performance analysis. Jason Roberts is a senior softwareengineer at Intel, and has worked on a number of differentmulti-threaded software products that span a wide range of applicationstargeting desktop, handheld, and embedded DSP platforms.
To read more onEmbedded.com about the topics discussed in this article, go to “Moreabout multicores, multiprocessors and multithreading.”