CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

The basics of programming embedded processors: Part 7
Trace-Driven Performance Analysis



Embedded.com

Cache optimizations
A loop nest is a set of loops, one inside the other, as illustrated in Figure 5-24 below. Loop nests occur when we process arrays. A large body of techniques has been developed for optimizing loop nests. Rewriting a loop nest changes the order in which array elements are accessed. This can expose new parallelism opportunities that can be exploited by later stages of the compiler, and it can also improve cache performance. In this section we concentrate on the analysis of loop nests for cache performance.

Figure 5-24. Vector representation of nexted loops

Let us first consider the examples of Figure 5-24 above in more detail. We can represent the set of array locations indexed by the loop by an iteration space. Each point in this space is defined by a vector, each component of which is one of the loop's iteration variables.

The points in the space represent the iterations executed by the program. The top example in the figure accesses every element of the arrays (the a and b matrices have the same access patterns in this case). The lower example accesses bands in the a and b matrices. We can represent any particular point in the iteration space as an iteration vector, such as the vector i3,2 = <3, 2> shown in the figure.

Cashe miss equations. Ghosh et al. [Gho97] developed an analytical technique for describing data cache misses in nested loops. The equations not only describe the set of cache misses but also help us to optimize the program's access patterns.

Because numerous equations are necessary to describe a program, the equations are designed to be generated and solved by design tools. Here we concentrate on the basic principles underlying the equations. After understanding these principles, we can apply them by hand in small cases as well as understand what our tools are doing in large cases.

We need to write equations that describe all types of misses, including compulsory, capacity, and conflict. The equations are written in terms of reuse vectors—if two iterations i1 and i2 use the same array location, the reuse vector that defines that reuse is r = i2 ' i1 (assuming that i2 comes after i1).

Assume for the moment that our cache is direct mapped; we need to know the cache's line size so that we can compute which cache line each memory reference falls into. Compulsory misses occur in two cases - in the first access along the reuse vector and when the reuse vector spans more than one cache line.

Capacity and conflict misses can be captured in a single form of equation. In the following equation, two references i and j conflict if they access distinct memory lines that map to the same cache line:

L(i) = L(j). (EQ 5-1)

If the cache size is Cs and the line size is Ls, we can rewrite this equation as

mem(i) = mem(j)+nCs+b. (EQ 5-2)

In Equation 5-2 above, n is an integer that represents an arbitrary offset and b is an integer in the range " Loff b Ls " 1 " Loff, where Loff is the offset of the memory reference in the cache line.

We can use these equations for a variety of purposes. For instance, if we want to minimize the number of conflicts between array accesses, we can parameterize the equations in terms of the starting addresses of the arrays and the number of elements in each row (by padding the arrays), and then find the values for these parameters that minimize the number of cache conflicts. Example 5-10 below  illustrates the concepts behind cache miss equations.

Example 5-10 Data Realignment and Array Padding
Assume we want to optimize the cache behavior of the following code:
for (j = 0; j < M; j++)
for (i = 0; i < N; i++)
a[j][i] = b[j][i] * c;

Let us also assume that the a and b arrays are sized with M at 265 and N at 4 and a 256-line, four-way set-associative cache with four words per line. Even though this code does not reuse any data elements, cache conflicts can cause serious performance problems because they interfere with spatial reuse at the cache line level.

Assume that the starting location for a[] is 1024 and the starting location for b[] is 4099. Although a[0][0] and b[0][0] do not map to the same word in the cache, they do map to the same block as shown below.

As a result, we see the following scenario in execution:

  • The access to a[0][0] brings in the first four words of a[].
  • The access to b[0][0] replaces a[0] through a[0][3] with b[0][3] and the contents of the three locations before b[].
  • When a[0][1] is accessed, the same cache line is again replaced with the first four elements of a[].
Once the a[0][1] access brings that line into the cache, it remains there for the a[0][2] and a[0][3] accesses since the b[] accesses are now on the next line. However, the scenario repeats itself at a[0][4] and every four iterations of the cache. One way to eliminate the cache conflicts is to move one of the arrays. We do not have to move it far. If we move b's start to 4100, we eliminate the cache conflicts.

However, that fix won't work in more complex situations. Moving one array may only introduce cache conflicts with another array. In such cases, we can use another technique called padding. If we extend each of the rows of the arrays to have four elements rather than three, with the padding word placed at the beginning of the row, we eliminate the cache conflicts. In this case, b[0][0] is located at 4100 by the padding.

Although padding wastes memory, it substantially improves memory performance. In complex situations with multiple arrays and sophisticated access patterns, we have to use a combination of techniques - relocating arrays and padding them - to be able to minimize cache conflicts.

1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :