Common multicore programming problems: Part 4 - Memory, cache issues and consistency
How to conserve memory bandwidth, avoid contention problems and the importance of memory consistency
By Shameem Akhter and Jason Roberts, Intel Corp.
Embedded.com
(11/13/07, 12:24:00 AM EST)
When most people perform calculations by hand, they are limited by how fast they can do the calculations, not how fast they can read and write. Early microprocessors were similarly constrained.

In recent decades, microprocessors have grown much faster in speed than in memory. A single microprocessor core can execute hundreds of operations in the time it takes to read or write a value in main memory.

Programs now are often limited by the memory bottleneck, not processor speed. Multi-core processors can exacerbate the problem unless care is taken to conserve memory bandwidth and avoid memory contention.

Bandwidth
To conserve bandwidth, pack data more tightly, or move it less frequently between cores. Packing the data tighter is usually straightforward, and benefits sequential execution as well. For example, pack Boolean arrays into one Boolean value per bit, not one value per byte.

Use the shortest integer type that can hold values in the required range. When declaring structures in C/C++, declare fields in order of descending size. This strategy tends to minimize the extra padding that the compiler must insert to maintain alignment requirements, as exemplified in Figure 7.15 below.

Figure 7.15. Order Fields by Decreasing Size to Reduce Padding

Some compilers also support "#pragma pack" directives that pack structures even more tightly, possibly by removing all padding. Such very tight packing may be counterproductive, however, because it causes misaligned loads and stores that may be significantly slower than aligned loads and stores.

Working in the Cache
Moving data less frequently is a more subtle exercise than packing, because mainstream programming languages do not have explicit commands to move data between a core and memory. Data movement arises from the way the cores read and write memory. There are two categories of interactions to consider: those between cores and memory, and those between cores.

Data movement between a core and memory also occurs in single-core processors, so minimizing data movement benefits sequential programs as well. There exist numerous techniques.

For example, a technique called cache-oblivious blocking recursively divides a problem into smaller and smaller subproblems. Eventually the subproblems become so small that they each fit in cache. The Fastest Fourier Transform in the West (Frigo 1997) uses this approach and indeed lives up to its name.

Figure 7.16. Cache-Unfriendly Sieve of Eratosthenes

Another technique for reducing the cache footprint is to reorder steps in the code. Sometimes this is as simple as interchanging loops. Other times it requires more significant restructuring.

The Sieve of Eratosthenes is an elementary programming exercise that demonstrates such restructuring and its benefits. Figure 7.16 above presents the Sieve of Eratosthenes for enumerating prime numbers up to n.

This version has two nested loops: the outer loop finds primes, and the inner loop, inside function Strike, strikes out composite numbers. This version is unfriendly to cache, because the inner loop is over the full length of array composite, which might be much larger than what fits in cache.

Figure 7.17 below shows how the sieve can be restructured to be cache friendly. Instead of directly representing the conceptual sieve as one big array, it represents it as a small window into the conceptual sieve. The window size is approximately the square root of n bytes.

Figure 7.17. Cache-Friendly Sieve of Eratosthenes

The restructuring requires that the original inner loop be stopped when it reaches the end of a window, and restarted when processing the next window. The array striker stores the indices of these suspended loops, and has an element for each prime up to the square root of n.

The data structures grow much more slowly than n, and so fit in a 106 byte cache even when n approaches values as large as 1011. Of course, allocating array composite to hold 1011 bytes is impractical on most machines. The later discussion of multi-threading the sieve describes how to reduce composite to the square root of n bytes instead of n bytes.

The restructuring introduces extra complexity and bookkeeping operations. But because processor speed so greatly outstrips memory speed, the extra bookkeeping pays off dramatically.

Figure 7.18 below shows this performance difference. On this log plot, the cache friendly code has a fairly straight performance plot, while the cache unfriendly version's running time steps up from one straight line to another when n reaches approximately 106.

Figure 7.18. Performance Difference between Figure 7.16 and Figure 7.17

The step is characteristic of algorithms that transition from running in cache to running out of cache as the problem size increases. The restructured version is five times faster than the original version when n significantly exceeds the cache size, despite the extra processor operations required by the restructuring.

Memory Contention
For multi-core programs, working within the cache becomes trickier, because data is not only transferred between a core and memory, but also between cores. As with transfers to and from memory, mainstream programming languages do not make these transfers explicit. The transfers arise implicitly from patterns of reads and writes by different cores. The patterns correspond to two types of data dependencies:

1) Read-write dependency. A core writes a cache line, and then a different core reads it.
2) Write-write dependency. A core writes a cache line, and then a different core writes it.

An interaction that does not cause data movement is two cores repeatedly reading a cache line that is not being written. Thus if multiple cores only read a cache line and do not write it, then no memory bandwidth is consumed. Each core simply keeps its own copy of the cache line.

To minimize memory bus traffic, minimize core interactions by minimizing shared locations. Hence, the same patterns that tend to reduce lock contention also tend to reduce memory traffic, because it is the shared state that requires locks and generates contention. Letting each thread work on its own local copy of the data and merging the data after all threads are done can be a very effective strategy.

Consider writing a multi-threaded version of the function CacheFriendlySieve from Figure 7.17. A good decomposition for this problem is to fill the array factor sequentially, and then operate on the windows in parallel. The sequential portion takes time O(), and hence has minor impact on speedup for large n.

Operating on the windows in parallel requires sharing some data. Looking at the nature of the sharing will guide you on how to write the parallel version, to wit:

1) The array factor is read-only once it is filled. Thus each thread can share the array.

2) The array composite is updated as primes are found. However, the updates are made to separate windows, so they are unlikely to interfere except at window boundaries that fall inside a cache line. Better yet, observe that the values in the window are used only while the window is being processed.

The array composite no longer needs to be shared, and instead each thread can have a private portion that holds only the window of interest. This change benefits the sequential version too, because now the space requirements for the sieve have been reduced from O(n) to O(). The reduction in space makes counting primes up to 1011 possible on even a 32-bit machine.

3) The variable count is updated as primes are found. An atomic increment could be used, but that would introduce memory contention. A better solution, as shown in the example, is to give each thread perform a private partial count, and sum the partial counts at the end.

4) The array striker is updated as the window is processed. Each thread will need its own private copy. The tricky part is that striker induces a loop-carried dependence between windows. For each window, the initial value of striker is the last value it had for the previous window. To break this dependence, the initial values in striker have to be computed from scratch. This computation is not difficult. The purpose of striker[k] is to keep track of the current multiple of factor[k].

5) The variable base is new in the parallel version. It keeps track of the start of the window for which striker is valid. If the value of base differs from the start of the window being processed, it indicates that the thread must recompute striker from scratch. The recomputation sets the initial value of striker[k] to the lowest multiple of factor[k] that is inside or after the window.

Figure 7.19 below shows the multi-threaded sieve. A further refinement that cuts the work in half would be to look for only odd primes.

Figure 7.19. Parallel Sieve of Eratosthenes
The refinement was omitted from the examples because it obfuscates understanding of the multi-threading issues.

Cache-related Issues
As remarked earlier in the discussion of time-slicing issues, good performance depends on processors fetching most of their data from cache instead of main memory.

For sequential programs, modern caches generally work well without too much thought, though a little tuning helps. In parallel programming, caches open up some much more serious pitfalls.

False Sharing
The smallest unit of memory that two processors interchange is a cache line or cache sector. Two separate caches can share a cache line when they both need to read it, but if the line is written in one cache, and read in another, it must be shipped between caches, even if the locations of interest are disjoint.

Like two people writing in different parts of a log book, the writes are independent, but unless the book can be ripped apart, the writers must pass the book back and forth. In the same way, two hardware threads writing to different locations contend for a cache sector to the point where it becomes a ping-pong game.

Figure 7.20. Cache Line Ping Ponging Caused by False Sharing

Figure 7.20 above illustrates such a ping-pong game. There are two threads, each running on a different core. Each thread increments a different location belonging to the same cache line. But because the locations belong to the same cache line, the cores must pass the sector back and forth across the memory bus.

Figure 7.21 below shows how bad the impact can be for a generalization of Figure 7.20. Four single-core processors, each enabled with Hyper-Threading Technology (HT Technology), are used to give the flavor of a hypothetical future eight-core system.

Each hardware thread increments a separate memory location. The ith thread repeatedly increments x[i*stride]. The performance is worse when the locations are adjacent, and improves as they spread out, because the spreading puts the locations into more distinct cache lines.

Performance improves sharply at a stride of 16. This is because the array elements are 4-byte integers. The stride of 16 puts the locations 16?? 4 = 64 bytes apart. The data is for a Pentium 4 based processor with a cache sector size of 64 bytes.

Hence when the locations were 64 bytes part, each thread is hitting on a separate cache sector, and the locations become private to each thread. The resulting performance is nearly one hundredfold better than when all threads share the same cache line.

Figure 7.21. Performance Impact of False Sharing

Avoiding false sharing may require aligning variables or objects in memory on cache line boundaries. There are a variety of ways to force alignment. Some compilers support alignment pragmas. The Windows compilers have a directive __declspec(align(n)) that can be used to specify n-byte alignment. Dynamic allocation can be aligned by allocating extra pad memory, and then returning a pointer to the next cache line in the block.

Figure 7.22 below shows an example allocator that does this. Function CacheAlignedMalloc uses the word just before the aligned block to store a pointer to the true base of the block, so that function CacheAlignedFree can free the true block.

Notice that if malloc returns an aligned pointer, CacheAlignedMalloc still rounds up to the next cache line, because it needs the first cache line to store the pointer to the true base.

It may not be obvious that there is always enough room before the aligned block to store the pointer. Sufficient room depends upon two assumptions:

1) A cache line is at least as big as a pointer.
2) A malloc request for at least a cache line's worth of bytes returns a pointer aligned on boundary that is a multiple of sizeof(char*).

These two conditions hold for IA-32 and Itanium-based systems. Indeed, they hold for most architecture because of alignment restrictions specified for malloc by the C standard.

Figure 7.22. Memory Allocator that Allocates Blocks Aligned on Cache Line Boundaries

The topic of false sharing exposes a fundamental tension between efficient use of a single-core processor and efficient use of a multi-core processor. The general rule for efficient execution on a single core is to pack data tightly, so that it has as small a footprint as possible.

But on a multi-core processor, packing shared data can lead to a severe penalty from false sharing. Generally, the solution is to pack data tightly, give each thread its own private copy to work on, and merge results afterwards.

This strategy extends naturally to task stealing. When a thread steals a task, it can clone the shared data structures that might cause cache line ping ponging, and merge the results later.

Memory Consistency
At any given instant in time in a sequential program, memory has a well defined state. This is called sequential consistency. In parallel programs, it all depends upon the viewpoint.

Two writes to memory by a hardware thread may be seen in a different order by another thread. The reason is that when a hardware thread writes to memory, the written data goes through a path of buffers and caches before reaching main memory. Along this path, a later write may reach main memory sooner than an earlier write.

Similar effects apply to reads. If one read requires a fetch from main memory and a later read hits in cache, the processor may allow the faster read to "pass" the slower read. Likewise, reads and writes might pass each other.

Of course, a processor has to see its own reads and writes in the order it issues them, otherwise programs would break. But the processor does not have to guarantee that other processors see those reads and writes in the original order. Systems that allow this reordering are said to exhibit relaxed consistency.

Because relaxed consistency relates to how hardware threads observe each other's actions, it is not an issue for programs running time sliced on a single hardware thread. Inattention to consistency issues can result in concurrent programs that run correctly on single-threaded hardware, or even hardware running with HT Technology, but fail when run on multi-threaded hardware with disjoint caches.

The hardware is not the only cause of relaxed consistency. Compilers are often free to reorder instructions. The reordering is critical to most major compiler optimizations.

For instance, compilers typically hoist loop-invariant reads out of a loop, so that the read is done once per loop instead of once per loop iteration. Language rules typically grant the compiler license to presume the code is single-threaded, even if it is not.

This is particularly true for older languages such as Fortran, C, and C++ that evolved when parallel processors were esoteric. For recent languages, such as Java and C#, compilers must be more circumspect, but only when the keyword volatile is present.

Unlike hardware reordering, compiler reordering can affect code even when it is running time sliced on a single hardware thread. Thus the programmer must be on the lookout for reordering by the hardware or the compiler.

To read Part 1, go to "Threads, data races, deadlocks and live locks."
To read Part 2, go to " Heavily contended locks and priority inversion."
To read Part 3, go to "Nonblocking algorithms, ping-ponging, and memory reclamation."

This article was excerpted from Multi-Core Programming by Shameem Akhter and Jason Roberts. Copyright © 2006 Intel Corporation. All rights reserved.

Shameem Akhter is a platform architect at Intel Corporation, focusing on single socket multi-core architecture and performance analysis. Jason Roberts is a senior software engineer at Intel, and has worked on a number of different multi-threaded software products that span a wide range of applications targeting desktop, handheld, and embedded DSP platforms.

To read more on Embedded.com about the topics discussed in this article, go to "More about multicores, multiprocessors and multithreading."