CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Common multicore programming problems: Part 4 - Memory, cache issues and consistency
How to conserve memory bandwidth, avoid contention problems and the importance of memory consistency



Embedded.com
When most people perform calculations by hand, they are limited by how fast they can do the calculations, not how fast they can read and write. Early microprocessors were similarly constrained.

In recent decades, microprocessors have grown much faster in speed than in memory. A single microprocessor core can execute hundreds of operations in the time it takes to read or write a value in main memory.

Programs now are often limited by the memory bottleneck, not processor speed. Multi-core processors can exacerbate the problem unless care is taken to conserve memory bandwidth and avoid memory contention.

Bandwidth
To conserve bandwidth, pack data more tightly, or move it less frequently between cores. Packing the data tighter is usually straightforward, and benefits sequential execution as well. For example, pack Boolean arrays into one Boolean value per bit, not one value per byte.

Use the shortest integer type that can hold values in the required range. When declaring structures in C/C++, declare fields in order of descending size. This strategy tends to minimize the extra padding that the compiler must insert to maintain alignment requirements, as exemplified in Figure 7.15 below.

Figure 7.15. Order Fields by Decreasing Size to Reduce Padding

Some compilers also support "#pragma pack" directives that pack structures even more tightly, possibly by removing all padding. Such very tight packing may be counterproductive, however, because it causes misaligned loads and stores that may be significantly slower than aligned loads and stores.

Working in the Cache
Moving data less frequently is a more subtle exercise than packing, because mainstream programming languages do not have explicit commands to move data between a core and memory. Data movement arises from the way the cores read and write memory. There are two categories of interactions to consider: those between cores and memory, and those between cores.

Data movement between a core and memory also occurs in single-core processors, so minimizing data movement benefits sequential programs as well. There exist numerous techniques.

For example, a technique called cache-oblivious blocking recursively divides a problem into smaller and smaller subproblems. Eventually the subproblems become so small that they each fit in cache. The Fastest Fourier Transform in the West (Frigo 1997) uses this approach and indeed lives up to its name.

Figure 7.16. Cache-Unfriendly Sieve of Eratosthenes

Another technique for reducing the cache footprint is to reorder steps in the code. Sometimes this is as simple as interchanging loops. Other times it requires more significant restructuring.

The Sieve of Eratosthenes is an elementary programming exercise that demonstrates such restructuring and its benefits. Figure 7.16 above presents the Sieve of Eratosthenes for enumerating prime numbers up to n.

This version has two nested loops: the outer loop finds primes, and the inner loop, inside function Strike, strikes out composite numbers. This version is unfriendly to cache, because the inner loop is over the full length of array composite, which might be much larger than what fits in cache.

Figure 7.17 below shows how the sieve can be restructured to be cache friendly. Instead of directly representing the conceptual sieve as one big array, it represents it as a small window into the conceptual sieve. The window size is approximately the square root of n bytes.

Figure 7.17. Cache-Friendly Sieve of Eratosthenes

The restructuring requires that the original inner loop be stopped when it reaches the end of a window, and restarted when processing the next window. The array striker stores the indices of these suspended loops, and has an element for each prime up to the square root of n.

The data structures grow much more slowly than n, and so fit in a 106 byte cache even when n approaches values as large as 1011. Of course, allocating array composite to hold 1011 bytes is impractical on most machines. The later discussion of multi-threading the sieve describes how to reduce composite to the square root of n bytes instead of n bytes.

The restructuring introduces extra complexity and bookkeeping operations. But because processor speed so greatly outstrips memory speed, the extra bookkeeping pays off dramatically.

Figure 7.18 below shows this performance difference. On this log plot, the cache friendly code has a fairly straight performance plot, while the cache unfriendly version's running time steps up from one straight line to another when n reaches approximately 106.

Figure 7.18. Performance Difference between Figure 7.16 and Figure 7.17

The step is characteristic of algorithms that transition from running in cache to running out of cache as the problem size increases. The restructured version is five times faster than the original version when n significantly exceeds the cache size, despite the extra processor operations required by the restructuring.

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :