Memory-oriented optimization techniques for dealing with performance bottlenecks: Part 2

Wayne Wolf, Georgia Institute of Technology

May 06, 2013

Wayne Wolf, Georgia Institute of TechnologyMay 06, 2013

Editor’s Note: In a two part article from High Performance Embedded Computing, author Wayne Wolf examines how to overcome system performance bottlenecks with memory-oriented programming and compiler techniques. Part 2: buffer, data transfer and storage management as well as main memory optimization.

Buffers are important because they mediate between subsystems. In many embedded applications, the various stages produce and consume data somewhat irregularly. As a result, we need buffers to make sure that all data makes it from the producer to the consumer.

If the buffer is too small, data is lost. If the buffer is too large, then memory is wasted. Overly large buffers not only add to the manufacturing cost of the system, but they also consume both static and dynamic energy.

A traditional approach is to use dynamic memory management—for example, the C library functions malloc() and free(). By allocating buffers of the proper size, we can avoid memory overflows. This approach may be difficult to verify, and memory leaks—memory that is allocated but never freed—may escape into production code.

Just as important for embedded systems, the memory management routines themselves cost a great deal in both performance and energy consumption. Because the execution time of memory management routines is highly data-dependent, these routines make it difficult to meet real-time deadlines. A combination of static and dynamic buffer management techniques can make sure that buffers are properly sized at a much smaller runtime cost.

Several groups at IMEC, including De Greef et al. [5], Franssen et al. [6], and Masselos et al. [10], developed methodologies and algorithms for data transfer and storage management. This line of work tries to restructure code to both minimize the required amount of buffer memory and to improve the performance of accesses to that buffer memory.

Data transfer and storage management make use of many techniques, including loop transformations and control flow analysis. It requires analyzing the complete application to understand the dependencies between different sections of code, such as different loop nests, and to balance concerns across the entire program.

Panda et al. [14] give some examples that illustrate how loop transformations can be used to improve buffer utilization. Consider the following code.

for (i = 0; i < N; ++i )
  for ( j = 0; j< = N–L; ++j )
    b[i ] [j ] = 0;
for (i = 0; i < N; ++i )
  for ( j = 0; j < N – L; ++j )
    for (k = 0; k < L; ++k)
      b[i ][j ] + = a[i ][j + k]

This code causes buffering problems because both loop nests modify the b array. The first loop nest modifies all elements of b before the second loop nest has a chance to use those locations. If we combine the two loop nests, we can reuse the b elements much more easily.

for (i = 0; i < N; ++i )
  for ( j = 0; j < N – L; ++j )
    b[i ] [j ] = 0;
    for (k = 0; k < L; ++k)
      b[i ] [j ] + = a[ i ] [ j + k];

Moving the first definition of b closer to its next use makes it much easier for lower-level optimizations to take advantage of mechanisms such as prefetching. Panda et al. [14] also showed that loop analysis can be used to help analyze buffers more explicitly.

They added signal copies to the code to make data reuse explicit. The buffer a_buf is L words long and buffers the a array; b_buf is one word long and buffers the b values. These buffers are declared in the program but do not need to exist in the final implementation.

int a_buf[L];
int b_buf;
for (i = 0; i < N; ++i ) {
  initial ize a_buf;
  for ( j = 0; j < N – L; ++j ) {
    b_buf = 0;
    a_buf[( j + L–1)%L] = a[i ][ j + L – 1];
  for (k = 0; k < L; ++k)
    b_buf + = a_buf[( j + k)%L]
  b[i ][j ] = b_buf;

Once the copies have been made explicit, analysis methods can be used to determine the lifetimes of values in those buffers.

Cache- and Scratch Pad–Oriented Optimizations
Cache-oriented optimizations take advantage of two different cache effects. Clearly, some optimizations are designed to improve cache hit rates by reducing the number of conflicts caused by memory access patterns. By rearranging data so that important elements do not knock each other out of the cache, we can greatly increase performance as well as reduce energy consumption.

Data can also be rearranged in the cache to take advantage of prefetching. Virtually all caches fetch more than one word at a time. If we can organize data so that a single cache miss pulls in several successive values of interest, then we can greatly reduce the access time for the other elements of the cache line.

Panda et al. [11] developed algorithms to place data in memory to reduce cache conflicts. Their methodology for scalar variables includes four steps:

1. Build a closeness graph that describes the relationship between accesses.
2. Cluster the variables into cache line-sized units.
3. Build a cluster interference graph that describes how clusters interact in the cache.
4. Use the cluster interference graph to optimize cluster placement.

The closeness graph is built from the access patterns of the variables, and it is a weighted, fully connected undirected graph with one node per variable. The weight of each edge {u, v} is equal to the length of the path between u and v.

We then group the nodes of the graph into clusters of size L or less, where L is the number of words in a cache line; they use a greedy heuristic. The cluster interference graph has one node per cluster. The weight of the edge {U, V} is equal to the number of times that clusters U and V are alternately executed. They use the greedy algorithm of Figure 3-21 to place clusters in memory.

Figure 3-21: An algorithm for assigning clusters of scalar variables to memory to minimize conflicts

Panda et al. [11] use a similar technique to place arrays accessed by inner loops, though the method required to evaluate array conflicts is more complex than the simple test used for scalars. We will concentrate on the direct-mapped cache case here.

They first build an interference graph whose nodes are the arrays; the weight of an edge {u, v} is equal to the number of cache conflicts that could arise between the two arrays. To optimize array placement, they need to calculate whether two arrays in the same loop will map into the same cache line. We are given two addresses, X and Y. The direct-mapped cache has line size k, and a line holds M words. Then X and Y will map onto the same cache line if

which can be rewritten as

where n is an integer.

Figure 3-22 shows the Panda et al. algorithm for assigning addresses to arrays. They use a greedy strategy to move arrays apart to minimize conflicts. The AssignmentCost function uses the above formula to test the placement of an array A against all other arrays with which it may conflict and return the cost of a particular address assignment.

Figure 3-22: Algorithm for assigning addresses to arrays to minimize direct-mapped cache conflicts

Kandemir et al. [8] developed a methodology that combines data and loop transformations to optimize cache performance. They first transform the loop nest such that the destination array (the array on the left side of the assignment) has its innermost index as the only element in one array dimension and that this index is not used in any other dimension.

They then align references to the right side matrices to conform to the access pattern for the left side. They search through various transformations on the right side to find the best alternative.

Panda et al. [13] developed methods for managing scratch pad contents by allocating variables to the scratch pad. They determined that static variables were best managed statically, with allocation decisions made at compile time. They chose to map all scalars to the scratch pad to avoid creating cache conflicts between arrays and scalars.

< Previous
Page 1 of 2
Next >

Loading comments...

Parts Search