Memory-oriented optimization techniques for dealing with performance bottlenecks: Part 1 - Embedded.com

Memory-oriented optimization techniques for dealing with performance bottlenecks: Part 1

(Editor’s Note : In a two part article from High Performance Embedded Computing author Wayne Wolf examines how to overcome system performance bottlenecks with memory-oriented programming and compiler techniques. Part 1: loop transformations and a variety of global-, cache-, and scratchpad-oriented optimizations . )

Memory is a key bottleneck in embedded systems. Many embedded computing applications spend a lot of their time accessing memory. The memory system is a prime determinant of not only performance but also energy consumption.

Memory system optimizations can target any stage of the memory hierarchy. A large variety of techniques have been developed by both the general-purpose and embedded communities to optimize cache performance. More recently, optimization techniques for scratch pad memories have been developed. Optimizations can also target main memories, particularly when they are partitioned.

Memory system optimizations can target either data or instructions. Array optimizations are an important class of data-oriented optimizations. Control flow analysis leads to methods for improving the cache behavior of instructions. Global memory analysis is particularly important in embedded systems.

Many embedded systems are composed of many subsystems that pass data between themselves. The buffers between those subsystems must be carefully sized to avoid both buffer overflows and wasted memory.

Loop Transformations
Some optimizations are applied early during compilation, without detailed knowledge of the target hardware. Such transformations try to expose parallelism that can be used by later stages. Loops are a prime candidate for such transformations since they may provide a great source for data parallelism.

Loop transformations have been studied for decades for use in scientific programs and optimizing compilers. There is not enough space here to do this topic justice; we only introduce a few concepts that illustrate how this body of work can be used.

An ideal loop can be executed entirely in parallel. The code on the left side of Figure 3-15 has a loop body in which all the arrays are indexed only by i . Therefore, no loop iteration depends on any other iteration. As a result, all the loop bodies could be executed in parallel, in any order, without affecting the result. In the code on the right side of the figure, the ith iteration depends on the result of the i – 1th iteration.

Figure 3-15 Parallelism and constraints in loops

In this case, we cannot finish iteration i until i – 1 has also finished, so these loop bodies must be done in the order in which they are enumerated by the loop. Data dependencies from one loop iteration to another are known as loop-carried dependencies

The compiler must schedule operations so that they do not violate the data dependencies—that is, the code does not try to use a value before it has been computed. In general, many possible schedules satisfy the data dependencies, provided that we have a way to enumerate those dependencies. While a single loop may provide some opportunities, a loop nest offers many possibilities for parallelism that require detailed analysis.

As shown in Figure 3-16 , a loop nest is a set of loops, one inside the other. A perfect loop nest has no conditionals inside the nest. An imperfect loop nest has conditionals that cause some statements in the nest to not be executed in some cases.

Figure 3-16: Loop nests

Loop and loop nests have many data dependencies—each operation in each iteration generates its own data dependency. However, we also know that the loop provides structure and regularity that we can exploit. Some possible transformations on loops and loop nests include:

  • Loop permutation changes the order of the loops in the nest.
  • Index rewriting changes the way the loop indexes are expressed.
  • Loop unrolling creates several copies of a loop body and modifies the loop indexes appropriately.
  • Loop splitting takes a loop with multiple operations and creates a separate loop for each operation; loop fusion performs the opposite.
  • Loop tiling splits a loop into a nest of loops, with each inner loop working on a small block of data.
  • Loop padding adds data elements to an array to change how the array maps into the memory system structure.

The polytope model [1; 4] is commonly used to represent and manipulate the data dependencies in loop nests. Figure 3-17 shows a loop nest. We can represent the loop nest as a matrix: each column in the matrix represents the iteration bounds of a loop.

Figure 3-17: Polytope representation of loop nests

The inner loop body modifies the values of c , creating a data dependency from c[i ][j ] to c[i ][j + 1] . We represent each data element in the array as a node in a two-dimensional space, with the loop iteration variables forming the axes of the space. The nodes define a polytope in that space in a triangular shape.

We add edges between the nodes that describe the loop-carried dependency between c[i ][j ] and c[i ][j + 1] . The points in the polytope fully describe all the loop-carried dependencies, but we have not yet reduced the complexity of that representation. We can use a distance vector that describes the direction and distance of a set of vectors. In this case, all the data dependencies are of the form [i j ] = [1 0] . Any set of loops that performs this computation must satisfy these loopcarried dependencies, but many schedules are, in general, possible. We can apply an ordering vector that describes the order in which the loops visit the nodes.

We can use matrix algebra to manipulate the form of the loops. For example, if we want to interchange the loops, putting j in the outer loop and i in the inner loop, we can multiply the loop nest matrix by an interchange matrix, as follows.

This gives the loop nest
   for (j = 0; j < i ; j ++)
      for (i = 0; i < N; i ++)
         c [i + 1][ j ] = a[ i ][j ]*b[j ] ;

Not all loop transformations are legal. Wolf and Lam [15] showed that a loop transformation is legal if all the transformed dependence vectors are lexicographically positive—that is, if they do not point backward in the iteration space.

Not all important loop transformations can be manipulated by matrix algebra. Loop tiling, for example, splits a large array into several smaller arrays to change the order in which the array elements are traversed. This transformation cannot be represented by matrix manipulations, but once a loop is tiled, the new loop nest can be analyzed using matrix methods.

Loop permutation and loop fusion [16] can be used to reduce the time required to access matrix elements. Loop permutation reduces latency when it is used to change the order of array accesses to that used in the underlying data structure. Multidimensional arrays are stored by C in row-major format, so we want to access rows first. Figure 3-18 shows an example of loop permutation.

Figure 3-18: An example of loop permutation

Loop fusion (see example in Figure 3-19 ) allows references to the same array in different loops to be combined and reused. The layout of array elements can also be modified to change the way that
they map into the cache or a parallel memory system.

For example, transposing a matrix is an alternative to loop permutation. Loops can also be padded to change how the data elements fall into cache lines. The wasted memory may be more than made up for by improved cache performance.

Figure 3-19: An example of loop fusion

Given that the memory system is a major contributor to system powerconsumption, we would expect that loop transformations can either hurtor help the energy consumption of a program. Kandemir et al. [9] studiedthe effects of compiler transformations on energy consumption bysimulating different versions of several benchmark programs withSimplePower.

Figure 3-20 summarizes their results. Theyexperimented with different types of transformations on differentbenchmarks and measured the energy consumption of unoptimized andoptimized code, testing several cache configurations for each programimplementation.

Click on image to enlarge.

Figure 3-20: Simulation measurements of the effects of compiler transformations on energy consumption

Oneinteresting result of these experiments is that, with the exception ofloop unrolling, most optimizations increase the amount of energyconsumed in the CPU core. Given the Kandemir et al. technologyparameters, the increase in core energy consumption was more than offsetby the reduced energy consumption in the memory system, but a differenttechnology could result in net energy losses for such transformations.

Anyoptimization strategy must balance the energy consumption of the memorysystem and the core. These experiments also show that increasing thecache size and associativity does come at the cost of increased staticand dynamic energy consumption in the cache.

Once again, losseswere more than offset by gains in the rest of the memory system forthese technology parameters, but a different technology could shift thebalance.

Global Optimizations
Compilers apply a greatmany transformations. The order of transformations is important—sometransformations enable other transformations, and some make itimpossible to apply other transformations later. Optimization strategieshave been developed for both general-purpose compilers and for embeddedsystems.

The Real-Time Specification for Java (RTSJ) [2] is astandard version of Java that allows programmers to consider thetemporal properties of their Java programs to determine whether they canmeet deadlines. The designers of the standard identified three majorfeatures of Java that limit programmers’ ability to determine aprogram’s real-time properties: scheduling, memory management, andsynchronization.

Java did not provide detailed specificationsabout scheduling. RTSJ requires a fixed-priority preemptive schedulerwith at least 28 unique priorities. A RealtimeThread class definesthreads that can be controlled by this scheduler. Java uses garbagecollection to simplify memory management for the general purposeprogrammer but at the price of reduced predictability of memory systems.To improve the predictability of memory management, RTSJ allowsprograms to allocate objects outside the heap.

A MemoryAreaclass allows programs to represent a memory area that is not garbagecollected. It supports three types of objects: physical memory, whichallows for the modeling of non-RAM memory components; immortal memory,which lives for the execution duration of the program; and scopedmemory, which allows the program to manage memory objects using thesyntactic scope of the object as an aid.

The RTSJ does notenforce priority-based synchronization, but it does provide additionalsynchronization mechanisms. The system queues all threads that arewaiting for a resource so that they acquire the resource in priorityorder. Synchronization must implement a priority inversion protocol.RTSJ also provides a facility to handle asynchronous events, such as ahardware interrupt.

Bacon et al. [1] proposed a flow foroptimizing compilers. Without going into all the details, we shouldconsider some of their important steps:

  • Procedure restructuring inlines functions, eliminates tail recursion, and so on.
  • High-level data flow optimization reduces the strength of operations, moves loop-invariant code, and so on.
  • Partial evaluation simplifies algebra, evaluates constants, and so on.
  • Loop preparation peels loops and so on.
  • Loop reordering interchanges, skews, and otherwise transforms loop nests.

A variety of lower-level transformations and code-generation steps complete the compilation process.

Catthooret al. [3] developed a methodology for streaming systems such asmulti-media. This methodology is designed to manage global data flow andachieve an efficient implementation. The steps in the methodologyinclude:
Memory-oriented data flow analysis and model extraction,which analyzes loops to identify the memory requirements of the variousphases of the program.
Global data flow transformations to improve memory utilization.
Global loop and control flow optimizations to eliminate system-level buffers and improve data locality.
Data reuse decisions for memory hierarchy use caches to reduce energy consumption and improve performance.
Memory organization designs the memory system and its ports.
In-place optimization uses low-level techniques to reduce storage requirements.

Part 2: Buffer, data transfer, and storage management

References
1. Compiler transformations for high performance computing , DF Bacon, SL Graham, OJ Sharp – ACM Computing Surveys (CSUR), 1994

2. The real time specification for Java , G Bollella, J Gosling – IEEE Computer, 2000 

3. Custom memory management methodology , Francky Catthoor, Eddy de Greef, Sven Suytack,
Exploration of Memory Organisation for Embedded Multimedia System Design
Kluwer Academic Publishers

4. Loop parallelization algorithms: from parallelism extraction to code generation ; P Boulet, A Darte, GA Silber, F Vivien – Parallel Computing, 1998 – Elsevier

5. Memory organization for video algorithms on programmable system processors , De Greef, E.; Catthoor, F.; De Man, H. , VLSI in Computers and Processors 1995

6. Control flow optimization for fast system simulation , European Design and Test Conference, 1994, Franssen, F. Nachtergaele, L.; Samsom, H.; Catthoor, F.; De Man, H.

7. Memory aware compilation through accurate timing extraction , P Grun, N Dutt, A Nicolau, Proceedings of the 37th Annual Design Automation Conference

8. Improving cache locality by a combination of loop and data transformations , Kandemir, M., Ramanujam, J.; Choudhary, A., IEEE Transactions on Cpmputers, 1999
Influence of compiler optimization on system power, Mahmut Kandemir, Narayanan Vijaykrishnan, Mary

9. Influence of compiler optimization on system power , Mahmut Kandemir, Narayanan Vijaykrishnan, Mary Jane Irwin, Wu Ye, Proceedings of the 37th Annual Design Automation Conference

10. A performance oriented use methodology of power optimizing code transformations , Masselos, K. Catthoor, F.; Goutis, C.E.; DeMan, H. IEEE Workshop on Signal Processing Systems ,

11. Memory data organization for improved cache performance in embedded processor applications , PR Panda, ND Dutt, A Nicolau , ACM Transactions on Design Automation of Electronic Systems

12. Memory bank customization and assignment in behavioral systems , PR Panda Proceedings of the IEEE/ACM international conference on Computer-aided design

13. On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems , PR Panda, ND Dutt, A Nicolau, ACM Transactions on Design Automation of Electronic Systems

14. Data and memory optimization techniques for embedded systems , ACM Transactions on Design Automation of Electronic Systems , P. R. Panda et. al.

15. A data locality optimizing algorithm , ME Wolf, MS Lam , ACM Sigplan Notices

16. “High performance compilers for parallel computing”, MJ Wolfe, C Shanklin, L Ortega, in High Performance Compilers for Parallel Computing , Addison-Wesley Longman Publishing Co., Inc.

Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding theRhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded ComputerSystems at Georgia Institute of Technology's School of Electrical andComputer Engineering (ECE). Previously a professor of electricalengineering at Princeton University, he worked at AT&T BellLaboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems .

This two part series is based on material printed from High-Performance Embedded Computing by Wayne Wolf, with permission from Morgan Kaufmann, a division ofElsevier, Copyright 2007. For more information about this title andother similar books, visit www.elsevierdirect.com .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.