Cache
optimizations
A loop nest is a set of loops, one inside the other,
as illustrated in Figure 5-24 below.
Loop nests occur when we process arrays. A large body of techniques has
been developed for optimizing loop nests. Rewriting a loop nest changes
the order in which array elements are accessed. This can expose new
parallelism opportunities that can be exploited by later stages of the
compiler, and it can also improve cache performance. In this section we
concentrate on the analysis of loop nests for cache performance.
 |
| Figure
5-24. Vector representation of nexted loops |
Let us first consider the examples of Figure 5-24 above in more detail. We
can represent the set of array locations indexed by the loop by an iteration
space. Each point in this space is defined by a vector, each
component of which is one of the loop's iteration variables.
The points in the space represent the iterations executed by the
program. The top example in the figure accesses every element of the
arrays (the a and b matrices have the same access patterns in this
case). The lower example accesses bands in the a and b matrices. We can
represent any particular point in the iteration space as an iteration
vector, such as the vector i3,2
= <3, 2> shown in the figure.
Cashe
miss equations. Ghosh et al. [Gho97]
developed an analytical technique for describing data cache misses in
nested loops. The equations not only describe the set of cache misses
but also help us to optimize the program's access patterns.
Because numerous equations are necessary to describe a program, the
equations are designed to be generated and solved by design tools. Here
we concentrate on the basic principles underlying the equations. After
understanding these principles, we can apply them by hand in small
cases as well as understand what our tools are doing in large cases.
We need to write equations that describe all types of misses,
including compulsory, capacity, and conflict. The equations are written
in terms of reuse vectors—if two iterations i1
and i2 use the same array location, the reuse
vector that defines that reuse is r = i2
' i1 (assuming that i2 comes
after i1).
Assume for the moment that our cache is direct mapped; we need to
know the cache's line size so that we can compute which cache line each
memory reference falls into. Compulsory misses occur in two cases - in
the first access along the reuse vector and when the reuse vector spans
more than one cache line.
Capacity and conflict misses can be captured in a single form of
equation. In the following equation, two references i and j
conflict if they access distinct memory lines that map to the same
cache line:
L(i) = L(j). (EQ 5-1)
If the cache size is Cs and the line size is Ls, we can rewrite this
equation as
mem(i) = mem(j)+nCs+b. (EQ 5-2)
In Equation 5-2 above, n
is an integer that represents an arbitrary offset and b is an integer
in the range " Loff b Ls " 1 " Loff,
where Loff is the offset of the memory reference
in the cache line.
We can use these equations for a variety of purposes. For instance,
if we want to minimize the number of conflicts between array accesses,
we can parameterize the equations in terms of the starting addresses of
the arrays and the number of elements in each row (by padding the
arrays), and then find the values for these parameters that minimize
the number of cache conflicts. Example
5-10 below illustrates the concepts behind cache miss
equations.
Example 5-10 Data
Realignment and Array Padding
Assume we want to optimize the cache behavior of the following code:
for (j = 0; j < M; j++)
for (i = 0; i < N; i++)
a[j][i] = b[j][i] * c;
Let us also assume that the a and b arrays are sized with M at 265
and N at 4 and a 256-line, four-way set-associative cache with four
words per line. Even though this code does not reuse any data elements,
cache conflicts can cause serious performance problems because they
interfere with spatial reuse at the cache line level.
Assume that the starting location for a[] is 1024 and the starting
location for b[] is 4099. Although a[0][0] and b[0][0] do not map to
the same word in the cache, they do map to the same block as shown
below.

As a result, we see the following scenario in execution:
- The access to a[0][0] brings in the first four words of a[].
- The access to b[0][0] replaces a[0] through a[0][3] with b[0][3]
and the contents of the three locations before b[].
- When a[0][1] is accessed, the same cache line is again replaced
with the first four elements of a[].
Once the a[0][1] access brings that line into the cache, it
remains
there for the a[0][2] and a[0][3] accesses since the b[] accesses are
now on the next line. However, the scenario repeats itself at a[0][4]
and every four iterations of the cache. One way to eliminate the cache
conflicts is to move one of the arrays. We do not have to move it far.
If we move b's start to 4100, we eliminate the cache conflicts.
However, that fix won't work in more complex situations. Moving one
array may only introduce cache conflicts with another array. In such
cases, we can use another technique called padding. If we extend each
of the rows of the arrays to have four elements rather than three, with
the padding word placed at the beginning of the row, we eliminate the
cache conflicts. In this case, b[0][0] is located at 4100 by the
padding.
Although padding wastes memory, it substantially improves memory
performance. In complex situations with multiple arrays and
sophisticated access patterns, we have to use a combination of
techniques - relocating arrays and padding them - to be able to
minimize cache conflicts.