Common multicore programming problems: Part 4 - Memory, cache issues and consistency
How to conserve memory bandwidth, avoid contention problems and the importance of memory consistency
By Shameem Akhter and Jason Roberts, Intel Corp.
When most people perform calculations by hand, they are limited by how
fast they can do the calculations, not how fast they can read and
write. Early microprocessors were similarly constrained.
In recent decades, microprocessors have grown much faster in speed
than in memory. A single microprocessor core can execute hundreds of
operations in the time it takes to read or write a value in main
memory.
Programs now are often limited by the memory bottleneck, not
processor speed. Multi-core processors can exacerbate the problem
unless care is taken to conserve memory bandwidth and avoid memory
contention.
Bandwidth
To conserve bandwidth, pack data more tightly, or move it less
frequently between cores. Packing the data tighter is usually
straightforward, and benefits sequential execution as well. For
example, pack Boolean arrays into one Boolean value per bit, not one
value per byte.
Use the shortest integer type that can hold values in the required
range. When declaring structures in C/C++, declare fields in order of
descending size. This strategy tends to minimize the extra padding that
the compiler must insert to maintain alignment requirements, as
exemplified in Figure 7.15 below.
 |
| Figure
7.15. Order Fields by Decreasing Size to Reduce Padding |
Some compilers also support "#pragma pack" directives that pack
structures even more tightly, possibly by removing all padding. Such
very tight packing may be counterproductive, however, because it causes
misaligned loads and stores that may be significantly slower than
aligned loads and stores.
Working in the Cache
Moving data less frequently is a more subtle exercise than packing,
because mainstream programming languages do not have explicit commands
to move data between a core and memory. Data movement arises from the
way the cores read and write memory. There are two categories of
interactions to consider: those between cores and memory, and those
between cores.
Data movement between a core and memory also occurs in single-core
processors, so minimizing data movement benefits sequential programs as
well. There exist numerous techniques.
For example, a technique called cache-oblivious blocking recursively
divides a problem into smaller and smaller subproblems. Eventually the
subproblems become so small that they each fit in cache. The Fastest Fourier
Transform in the West (Frigo 1997) uses this approach and indeed
lives up to its name.
 |
| Figure
7.16. Cache-Unfriendly Sieve of Eratosthenes |
Another technique for reducing the cache footprint is to reorder
steps in the code. Sometimes this is as simple as interchanging loops.
Other times it requires more significant restructuring.
The Sieve of
Eratosthenes is an elementary programming exercise that
demonstrates such restructuring and its benefits. Figure 7.16 above presents the Sieve
of Eratosthenes for enumerating prime numbers up to n.
This version has two nested loops: the outer loop finds primes, and
the inner loop, inside function Strike, strikes out composite numbers.
This version is unfriendly to cache, because the inner loop is over the
full length of array composite, which might be much larger than what
fits in cache.
Figure 7.17 below shows how
the sieve can be restructured to be cache friendly. Instead of directly
representing the conceptual sieve as one big array, it represents it as
a small window into the conceptual sieve. The window size is
approximately the square root of n bytes.
 |
| Figure
7.17. Cache-Friendly Sieve of Eratosthenes |
The restructuring requires that the original inner loop be stopped
when it reaches the end of a window, and restarted when processing the
next window. The array striker stores the indices of these suspended
loops, and has an element for each prime up to the square root of n.
The data structures grow much more slowly than n, and so fit in a
106 byte cache even when n approaches values as large as 1011. Of
course, allocating array composite to hold 1011 bytes is impractical on
most machines. The later discussion of multi-threading the sieve
describes how to reduce composite to the square root of n bytes instead
of n bytes.
The restructuring introduces extra complexity and bookkeeping
operations. But because processor speed so greatly outstrips memory
speed, the extra bookkeeping pays off dramatically.
Figure 7.18 below shows this
performance difference. On this log plot, the cache friendly code has a
fairly straight performance plot, while the cache unfriendly version's
running time steps up from one straight line to another when n reaches
approximately 106.
 |
| Figure
7.18. Performance Difference between Figure 7.16 and Figure 7.17 |
The step is characteristic of algorithms that transition from
running in cache to running out of cache as the problem size increases.
The restructured version is five times faster than the original version
when n significantly exceeds the cache size, despite the extra
processor operations required by the restructuring.