CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Common multicore programming problems: Part 4 - Memory, cache issues and consistency
How to conserve memory bandwidth, avoid contention problems and the importance of memory consistency



Embedded.com
Memory Contention
For multi-core programs, working within the cache becomes trickier, because data is not only transferred between a core and memory, but also between cores. As with transfers to and from memory, mainstream programming languages do not make these transfers explicit. The transfers arise implicitly from patterns of reads and writes by different cores. The patterns correspond to two types of data dependencies:

1) Read-write dependency. A core writes a cache line, and then a different core reads it.
2) Write-write dependency. A core writes a cache line, and then a different core writes it.

An interaction that does not cause data movement is two cores repeatedly reading a cache line that is not being written. Thus if multiple cores only read a cache line and do not write it, then no memory bandwidth is consumed. Each core simply keeps its own copy of the cache line.

To minimize memory bus traffic, minimize core interactions by minimizing shared locations. Hence, the same patterns that tend to reduce lock contention also tend to reduce memory traffic, because it is the shared state that requires locks and generates contention. Letting each thread work on its own local copy of the data and merging the data after all threads are done can be a very effective strategy.

Consider writing a multi-threaded version of the function CacheFriendlySieve from Figure 7.17. A good decomposition for this problem is to fill the array factor sequentially, and then operate on the windows in parallel. The sequential portion takes time O(), and hence has minor impact on speedup for large n.

Operating on the windows in parallel requires sharing some data. Looking at the nature of the sharing will guide you on how to write the parallel version, to wit:

1) The array factor is read-only once it is filled. Thus each thread can share the array.

2) The array composite is updated as primes are found. However, the updates are made to separate windows, so they are unlikely to interfere except at window boundaries that fall inside a cache line. Better yet, observe that the values in the window are used only while the window is being processed.

The array composite no longer needs to be shared, and instead each thread can have a private portion that holds only the window of interest. This change benefits the sequential version too, because now the space requirements for the sieve have been reduced from O(n) to O(). The reduction in space makes counting primes up to 1011 possible on even a 32-bit machine.

3) The variable count is updated as primes are found. An atomic increment could be used, but that would introduce memory contention. A better solution, as shown in the example, is to give each thread perform a private partial count, and sum the partial counts at the end.

4) The array striker is updated as the window is processed. Each thread will need its own private copy. The tricky part is that striker induces a loop-carried dependence between windows. For each window, the initial value of striker is the last value it had for the previous window. To break this dependence, the initial values in striker have to be computed from scratch. This computation is not difficult. The purpose of striker[k] is to keep track of the current multiple of factor[k].

5) The variable base is new in the parallel version. It keeps track of the start of the window for which striker is valid. If the value of base differs from the start of the window being processed, it indicates that the thread must recompute striker from scratch. The recomputation sets the initial value of striker[k] to the lowest multiple of factor[k] that is inside or after the window.

Figure 7.19 below shows the multi-threaded sieve. A further refinement that cuts the work in half would be to look for only odd primes.

Figure 7.19. Parallel Sieve of Eratosthenes
The refinement was omitted from the examples because it obfuscates understanding of the multi-threading issues.

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :