Memory-oriented optimization techniques for dealing with performance bottlenecks: Part 2 - Embedded.com

Memory-oriented optimization techniques for dealing with performance bottlenecks: Part 2

Editor’s Note: In a two part article from High Performance Embedded Computing , author Wayne Wolf examines how to overcome system performance bottlenecks with memory-oriented programming and compiler techniques. Part 2: buffer, data transfer and storage management as well as main memory optimization.

Buffers are important because they mediate between subsystems. In many embedded applications, the various stages produce and consume data somewhat irregularly. As a result, we need buffers to make sure that all data makes it from the producer to the consumer.

If the buffer is too small, data is lost. If the buffer is too large, then memory is wasted. Overly large buffers not only add to the manufacturing cost of the system, but they also consume both static and dynamic energy.

A traditional approach is to use dynamic memory management—for example, the C library functions malloc() and free(). By allocating buffers of the proper size, we can avoid memory overflows. This approach may be difficult to verify, and memory leaks—memory that is allocated but never freed—may escape into production code.

Just as important for embedded systems, the memory management routines themselves cost a great deal in both performance and energy consumption. Because the execution time of memory management routines is highly data-dependent, these routines make it difficult to meet real-time deadlines. A combination of static and dynamic buffer management techniques can make sure that buffers are properly sized at a much smaller runtime cost.

Several groups at IMEC, including De Greef et al. [5], Franssen et al. [6], and Masselos et al. [10], developed methodologies and algorithms for data transfer and storage management. This line of work tries to restructure code to both minimize the required amount of buffer memory and to improve the performance of accesses to that buffer memory.

Data transfer and storage management make use of many techniques, including loop transformations and control flow analysis. It requires analyzing the complete application to understand the dependencies between different sections of code, such as different loop nests, and to balance concerns across the entire program.

Panda et al. [14] give some examples that illustrate how loop transformations can be used to improve buffer utilization. Consider the following code.

for (i = 0; i < N; ++i )
  for ( j = 0; j< = N–L; ++j )
    b[i ] [j ] = 0;
for (i = 0; i < N; ++i )
  for ( j = 0; j < N – L; ++j )
    for (k = 0; k < L; ++k)
      b[i ][j ] + = a[i ][j + k]

This code causes buffering problems because both loop nests modify the b array. The first loop nest modifies all elements of b before the second loop nest has a chance to use those locations. If we combine the two loop nests, we can reuse the b elements much more easily.

for (i = 0; i < N; ++i )
  for ( j = 0; j < N – L; ++j )
    b[i ] [j ] = 0;
    for (k = 0; k < L; ++k)
      b[i ] [j ] + = a[ i ] [ j + k];

Moving the first definition of b closer to its next use makes it much easier for lower-level optimizations to take advantage of mechanisms such as prefetching. Panda et al. [14] also showed that loop analysis can be used to help analyze buffers more explicitly.

They added signal copies to the code to make data reuse explicit. The buffer a_buf is L words long and buffers the a array; b_buf is one word long and buffers the b values. These buffers are declared in the program but do not need to exist in the final implementation.

int a_buf[L];
int b_buf;
for (i = 0; i < N; ++i ) {
  initial ize a_buf;
  for ( j = 0; j < N – L; ++j ) {
    b_buf = 0;
    a_buf[( j + L–1)%L] = a[i ][ j + L – 1];
  for (k = 0; k < L; ++k)
    b_buf + = a_buf[( j + k)%L]
  b[i ][j ] = b_buf;
  }
}

Once the copies have been made explicit, analysis methods can be used to determine the lifetimes of values in those buffers.

Cache- and Scratch Pad–Oriented Optimizations
Cache-oriented optimizations take advantage of two different cache effects. Clearly, some optimizations are designed to improve cache hit rates by reducing the number of conflicts caused by memory access patterns. By rearranging data so that important elements do not knock each other out of the cache, we can greatly increase performance as well as reduce energy consumption.

Data can also be rearranged in the cache to take advantage of prefetching. Virtually all caches fetch more than one word at a time. If we can organize data so that a single cache miss pulls in several successive values of interest, then we can greatly reduce the access time for the other elements of the cache line.

Panda et al. [11] developed algorithms to place data in memory to reduce cache conflicts. Their methodology for scalar variables includes four steps:

1. Build a closeness graph that describes the relationship between accesses.
2. Cluster the variables into cache line-sized units.
3. Build a cluster interference graph that describes how clusters interact in the cache.
4. Use the cluster interference graph to optimize cluster placement.

The closeness graph is built from the access patterns of the variables, and it is a weighted, fully connected undirected graph with one node per variable. The weight of each edge {u, v} is equal to the length of the path between u and v .

We then group the nodes of the graph into clusters of size L or less, where L is the number of words in a cache line; they use a greedy heuristic. The cluster interference graph has one node per cluster. The weight of the edge {U, V} is equal to the number of times that clusters U and V are alternately executed. They use the greedy algorithm of Figure 3-21 to place clusters in memory.

Figure 3-21: An algorithm for assigning clusters of scalar variables to memory to minimize conflicts

Panda et al. [11] use a similar technique to place arrays accessed by inner loops, though the method required to evaluate array conflicts is more complex than the simple test used for scalars. We will concentrate on the direct-mapped cache case here.

They first build an interference graph whose nodes are the arrays; the weight of an edge {u, v} is equal to the number of cache conflicts that could arise between the two arrays. To optimize array placement, they need to calculate whether two arrays in the same loop will map into the same cache line. We are given two addresses, X and Y . The direct-mapped cache has line size k , and a line holds M words. Then X and Y will map onto the same cache line if

which can be rewritten as

where n is an integer.

Figure 3-22 shows the Panda et al. algorithm for assigning addresses to arrays. They use a greedy strategy to move arrays apart to minimize conflicts. The AssignmentCost function uses the above formula to test the placement of an array A against all other arrays with which it may conflict and return the cost of a particular address assignment.

Figure 3-22: Algorithm for assigning addresses to arrays to minimize direct-mapped cache conflicts

Kandemir et al. [8] developed a methodology that combines data and loop transformations to optimize cache performance. They first transform the loop nest such that the destination array (the array on the left side of the assignment) has its innermost index as the only element in one array dimension and that this index is not used in any other dimension.

They then align references to the right side matrices to conform to the access pattern for the left side. They search through various transformations on the right side to find the best alternative.

Panda et al. [13] developed methods for managing scratch pad contents by allocating variables to the scratch pad. They determined that static variables were best managed statically, with allocation decisions made at compile time. They chose to map all scalars to the scratch pad to avoid creating cache conflicts between arrays and scalars.

Arrays can be mapped into either the cache or the scratch pad. If twoarrays have nonintersecting lifetimes, then their relative allocation isnot important. Arrays with intersecting lifetimes may conflict;analysis is required to determine which ones are mapped into the cacheversus the scratch pad. Panda et al. define several metrics to analyzeconflicts:
VAC(u) , the variable access count of variable u, counts the number of times that u is accessed during its lifetime.
IAC(u) ,the interference access count of variable u, counts the number of timesother variables ( ) are accessed during the lifetime of u .
IF(u)
, the interference factor of u, is defined as

Variables with a high IF value are likely to be interfered with in the cache and are goodcandidates for promotion to the scratch pad. Panda et al. then use thesemetrics to define a loop-oriented conflict metric, known as the loopconflict factor or LCF of a variable u .

 

In this formula, the k(a) function denotes the number of accesses to variable a . The outer summation is over all p loops in which u is accessed, and the inner summation is over all variables accessed in the ith loop. The total conflict factor, TCF , for an array u is

Thesemetrics are used by algorithms to allocate variables between thescratch pad and main memory/cache. We can formulate the problem asfollows:

Thisproblem is a generalized knapsack problem in that several arrays withnon-overlapping lifetimes can intersect in the scratch pad. Panda etal.’s algorithm starts by clustering together arrays that could sharescratch pad space. It then uses an approximation algorithm that firstsorts the items by value per weight as given by access density AD .

Arraysare then greedily selected for allocation to the scratch pad, startingwith the array with the highest AD value, until the scratch pad isfull. Figure 3-23 shows the allocation algorithm of Panda et al.Scalar variables are allocated to the scratch pad, and arrays that aretoo large to fit into the scratch pad are allocated to main memory.

Figure 3-23: An algorithm for scratch pad allocation

Acompatibility graph is created to determine arrays with compatiblelifetimes. Cliques are then allocated, but the algorithm may allocateeither a full clique or a proper subset of the clique to the scratchpad.

Figure 3-24: Performance comparison of scratch pad allocation

Figure 3-24 shows the performance of scratch pad allocation algorithms [13] on onebenchmark. The experiment compared using on-chip memory only for SRAM,using on-chip memory only for data cache, random allocation into scratchpad that occupies half the available SRAM, and Panda et al.’sallocation algorithm into a scratch pad that occupies half the availableSRAM.

Main Memory-Oriented Optimizations
An idealizedmodel of a memory chip is functional: present an address to the memoryand receive the desired memory location after a fixed delay. However,high-performance memory systems break this simple model to providehigher performance and lower energy consumption:

  • Burst access modes access a sequence of memory locations.
  • Paged memories take advantage of the properties of memory components to reduce access times.
  • Banked memories are systems of memory components that allow parallel accesses.

Burstmodes are provided by some memory components as well as memorysubsystems such as Rambus. A burst mode access provides a start addressand a length; the length may be provided as a sequence of pulses or as abinary count. The memory accesses the sequence of locations startingwith the given location. Burst accesses reduce the number of addressesthat must be sent to memory. They also take advantage of internalregisters to increase the transfer rate.

Most large memorycomponents support some form of paged addressing. The details may differfrom device to device, but the address is split into two sections: pageand offset. (These may also be known as row and column.) The page ispresented first, followed by the offset. The page is also stored in aregister in the memory component so that it can be used for subsequentaccesses.

Successive accesses to the same page take much lesstime than accesses that hop across pages. This is due not only to thetime saved by not sending the page number to the component, but also tosignal propagation within the memory component. Some paging mechanismsare faster when sequential accesses are referenced. Other schemes merelyrequire that addresses on the same page be accessed.

A compilermay take advantage of paged addressing by proper placement of data andaddressing patterns. Scalar data may take advantage of paged addressingby placing successive references in adjacent locations on the samememory page. Array data accesses may also take advantage of locality.

As illustrated in Figure 3-25 ,a banked memory is built from several parallel memory components, whichmay or may not provide paged access. A banked memory may be built frommultiple chips; some single-chip memory components provide multiplememory banks.

Figure 3-25: A banked memory system

Someof the address bits are decoded outside the memory component to selectwhich chip is being addressed. Provided that the bus protocol isproperly designed, the memory system can start one access in one bank bypresenting an address to that bank, then start a second access to adifferent bank while waiting for the first access to complete.

Part 1: loop transformations; global-, cache-, and scratchpad optimizations

References
1. Compiler transformations for high performance computing , DF Bacon, SL Graham, OJ Sharp – ACM Computing Surveys (CSUR), 1994

2. The real time specification for Java , G Bollella, J Gosling – IEEE Computer, 2000 

3. Custom memory management methodology , Francky Catthoor, Eddy de Greef, Sven Suytack,
Exploration of Memory Organisation for Embedded Multimedia System Design
Kluwer Academic Publishers

4. Loop parallelization algorithms: from parallelism extraction to code generation ; P Boulet, A Darte, GA Silber, F Vivien – Parallel Computing, 1998 – Elsevier

5. Memory organization for video algorithms on programmable system processors , De Greef, E.; Catthoor, F.; De Man, H. , VLSI in Computers and Processors 1995

6. Control flow optimization for fast system simulation , European Design and Test Conference, 1994, Franssen, F. Nachtergaele, L.; Samsom, H.; Catthoor, F.; De Man, H.

7. Memory aware compilation through accurate timing extraction , P Grun, N Dutt, A Nicolau, Proceedings of the 37th Annual Design Automation Conference

8. Improving cache locality by a combination of loop and data transformations , Kandemir, M., Ramanujam, J.; Choudhary, A., IEEE Transactions on Cpmputers, 1999
Influence of compiler optimization on system power, Mahmut Kandemir, Narayanan Vijaykrishnan, Mary

9. Influence of compiler optimization on system power , Mahmut Kandemir, Narayanan Vijaykrishnan, Mary Jane Irwin, Wu Ye, Proceedings of the 37th Annual Design Automation Conference

10. A performance oriented use methodology of power optimizing code transformations , Masselos, K. Catthoor, F.; Goutis, C.E.; DeMan, H. IEEE Workshop on Signal Processing Systems ,

11. Memory data organization for improved cache performance in embedded processor applications , PR Panda, ND Dutt, A Nicolau , ACM Transactions on Design Automation of Electronic Systems

12. Memory bank customization and assignment in behavioral systems , PR Panda Proceedings of the IEEE/ACM international conference on Computer-aided design

13. On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems , PR Panda, ND Dutt, A Nicolau, ACM Transactions on Design Automation of Electronic Systems

14. Data and memory optimization techniques for embedded systems , ACM Transactions on Design Automation of Electronic Systems , P. R. Panda et. al.

15. A data locality optimizing algorithm , ME Wolf, MS Lam , ACM Sigplan Notices

16. “High performance compilers for parallel computing”, MJ Wolfe, C Shanklin, L Ortega, in High Performance Compilers for Parallel Computing , Addison-Wesley Longman Publishing Co., Inc.

Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding theRhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded ComputerSystems at Georgia Institute of Technology's School of Electrical andComputer Engineering (ECE). Previously a professor of electricalengineering at Princeton University, he worked at AT&T BellLaboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems .

This two part series is based on material printed from High-Performance Embedded Computing by Wayne Wolf, with permission from Morgan Kaufmann, a division ofElsevier, Copyright 2007. For more information about this title andother similar books, visit www.elsevierdirect.com .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.