Common multicore programming problems: Part 4 - Memory, cache issues and consistency
How to conserve memory bandwidth, avoid contention problems and the importance of memory consistency
By Shameem Akhter and Jason Roberts, Intel Corp.
Cache-related Issues
As remarked earlier in the discussion of time-slicing issues, good
performance depends on processors fetching most of their data from
cache instead of main memory.
For sequential programs, modern caches generally work well without
too much thought, though a little tuning helps. In parallel
programming, caches open up some much more serious pitfalls.
False Sharing
The smallest unit of memory that two processors interchange is a cache
line or cache sector. Two separate caches can share a cache line when
they both need to read it, but if the line is written in one cache, and
read in another, it must be shipped between caches, even if the
locations of interest are disjoint.
Like two people writing in different parts of a log book, the writes
are independent, but unless the book can be ripped apart, the writers
must pass the book back and forth. In the same way, two hardware
threads writing to different locations contend for a cache sector to
the point where it becomes a ping-pong game.
 |
| Figure
7.20. Cache Line Ping Ponging Caused by False Sharing |
Figure 7.20 above
illustrates such a ping-pong game. There are two threads, each running
on a different core. Each thread increments a different location
belonging to the same cache line. But because the locations belong to
the same cache line, the cores must pass the sector back and forth
across the memory bus.
Figure 7.21 below shows how
bad the impact can be for a generalization of Figure 7.20. Four
single-core processors, each enabled with Hyper-Threading Technology
(HT Technology), are used to give the flavor of a hypothetical future
eight-core system.
Each hardware thread increments a separate memory location. The ith
thread repeatedly increments x[i*stride]. The performance is worse when
the locations are adjacent, and improves as they spread out, because
the spreading puts the locations into more distinct cache lines.
Performance improves sharply at a stride of 16. This is because the
array elements are 4-byte integers. The stride of 16 puts the locations
16?? 4 = 64 bytes apart. The data is for a Pentium 4 based processor
with a cache sector size of 64 bytes.
Hence when the locations were 64 bytes part, each thread is hitting
on a separate cache sector, and the locations become private to each
thread. The resulting performance is nearly one hundredfold better than
when all threads share the same cache line.
 |
| Figure
7.21. Performance Impact of False Sharing |
Avoiding false sharing may require aligning variables or objects in
memory on cache line boundaries. There are a variety of ways to force
alignment. Some compilers support alignment pragmas. The Windows
compilers have a directive __declspec(align(n)) that can be used to
specify n-byte alignment. Dynamic allocation can be aligned by
allocating extra pad memory, and then returning a pointer to the next
cache line in the block.
Figure 7.22 below shows an
example allocator that does this. Function CacheAlignedMalloc uses the
word just before the aligned block to store a pointer to the true base
of the block, so that function CacheAlignedFree can free the true
block.
Notice that if malloc returns an aligned pointer, CacheAlignedMalloc
still rounds up to the next cache line, because it needs the first
cache line to store the pointer to the true base.
It may not be obvious that there is always enough room before the
aligned block to store the pointer. Sufficient room depends upon two
assumptions:
1) A cache line is at least
as big as a pointer.
2) A malloc request for at
least a cache line's worth of bytes returns a pointer aligned on
boundary that is a multiple of sizeof(char*).
These two conditions hold for IA-32 and Itanium-based systems.
Indeed, they hold for most architecture because of alignment
restrictions specified for malloc by the C standard.
 |
| Figure
7.22. Memory Allocator that Allocates Blocks Aligned on Cache Line
Boundaries |
The topic of false sharing exposes a fundamental tension between
efficient use of a single-core processor and efficient use of a
multi-core processor. The general rule for efficient execution on a
single core is to pack data tightly, so that it has as small a
footprint as possible.
But on a multi-core processor, packing shared data can lead to a
severe penalty from false sharing. Generally, the solution is to pack
data tightly, give each thread its own private copy to work on, and
merge results afterwards.
This strategy extends naturally to task stealing. When a thread
steals a task, it can clone the shared data structures that might cause
cache line ping ponging, and merge the results later.
Memory Consistency
At any given instant in time in a sequential program, memory has a well
defined state. This is called sequential consistency. In parallel
programs, it all depends upon the viewpoint.
Two writes to memory by a hardware thread may be seen in a different
order by another thread. The reason is that when a hardware thread
writes to memory, the written data goes through a path of buffers and
caches before reaching main memory. Along this path, a later write may
reach main memory sooner than an earlier write.
Similar effects apply to reads. If one read requires a fetch from
main memory and a later read hits in cache, the processor may allow the
faster read to "pass" the slower read. Likewise, reads and writes might
pass each other.
Of course, a processor has to see its own reads and writes in the
order it issues them, otherwise programs would break. But the processor
does not have to guarantee that other processors see those reads and
writes in the original order. Systems that allow this reordering are
said to exhibit relaxed consistency.
Because relaxed consistency relates to how hardware threads observe
each other's actions, it is not an issue for programs running time
sliced on a single hardware thread. Inattention to consistency issues
can result in concurrent programs that run correctly on single-threaded
hardware, or even hardware running with HT Technology, but fail when
run on multi-threaded hardware with disjoint caches.
The hardware is not the only cause of relaxed consistency. Compilers
are often free to reorder instructions. The reordering is critical to
most major compiler optimizations.
For instance, compilers typically hoist loop-invariant reads out of
a loop, so that the read is done once per loop instead of once per loop
iteration. Language rules typically grant the compiler license to
presume the code is single-threaded, even if it is not.
This is particularly true for older languages such as Fortran, C,
and C++ that evolved when parallel processors were esoteric. For recent
languages, such as Java and C#, compilers must be more circumspect, but
only when the keyword volatile is present.
Unlike hardware reordering, compiler reordering can affect code even
when it is running time sliced on a single hardware thread. Thus the
programmer must be on the lookout for reordering by the hardware or the
compiler.
To read Part 1, go to "Threads, data races, deadlocks and live
locks."
To read Part 2, go to " Heavily contended
locks and priority inversion."
To read Part 3, go to "Nonblocking
algorithms, ping-ponging, and memory reclamation."
This
article was excerpted from Multi-Core
Programming by Shameem Akhter and Jason Roberts.
Copyright © 2006 Intel Corporation. All rights reserved.
Shameem
Akhter is a platform architect at Intel Corporation, focusing on single
socket multi-core architecture and performance analysis. Jason Roberts is a senior software
engineer at Intel, and has worked on a number of different
multi-threaded software products that span a wide range of applications
targeting desktop, handheld, and embedded DSP platforms.
To read more on
Embedded.com about the topics discussed in this article, go to "More
about multicores, multiprocessors and multithreading."