The basics of programming embedded processors: Part 7 -

The basics of programming embedded processors: Part 7

A program trace (or more succinctly, a trace )is a record of the execution path of a program that has been measuredduring program execution. If we can measure a trace from a runningprogram, we avoid the need to analyze the program's CDFG to determinethe execution path. Traces are typically large. Since we usually needto look at the entire execution history of a program, traces can begigabytes in size.

However, because disks to hold the traces are inexpensive andworkstations to analyze them are fast and cheap, traces are anappealing way to analyze programs. We have to consider two problems:how to generate a trace and how to use it to determine programperformance.

A trace can be captured by using hardware methods. For example, alogic analyzer can be attached to the microprocessor bus to capturememory bus traffic. There are two major limitations to this approach.First, the size of hardware trace buffers is usually limited.

The hardware must be able to store instructions at execution speed;if we use high-speed RAM for the buffer, we are limited to perhapsseveral million instructions. This means that we probably cannotcapture a trace of the entire program but only a section of itsexecution. Second, if the CPU has an on-chip cache, we will not be ableto see internal memory references to the program or the data.

Alternatively, a microprocessor emulator can be used to trace the PCover time. Although this technique allows us to see what is going oninside the chip and not just behavior on the memory bus, it is too slowto generate long traces. Since the microprocessor must be stopped forprobing and probing is very slow compared to normal execution,capturing long traces would require exorbitant amounts of time.

Some CPUs have hardware facilities for automatically generatingtrace information. For example, the Pentium family microprocessorsgenerate a special bus cycle, a branch trace message, that shows thesource and/or destination address of a branch [Col97]. If we record only traces,we can reconstruct the instructions executed within the basic blockswhile greatly reducing the amount of memory required to hold the trace.

There are three major methods for generating a trace by software: PCsampling, instrumentation instructions, and simulation .The PC sampling technique uses the operating system or some otherfacility to periodically interrupt the program and save away thecurrent PC value.

This method has the same limitations of any sampling method – if thesampling period is too slow, we may miss important behavior; inparticular, the wrong sampling interval can be a multiple of someperiodic behavior in the program that could be totally missed.Alternatively, instructions can be added to the program that will writeout information to the trace buffer.

The instructions required depend on the information to be captured.The simplest case is to add code to the start of each basic block thatremembers the program execution going through that point. Traces areoften used to analyze cache or superscalar behavior, in which case theinstrumentation code may need to save additional information about thestate of the CPU.

Software-based trace generation methods do add execution overheadtime to the program, but this usually does not affect the measurement.Simulation methods interpret the instructions to simulate the CPU, andthey can generate the required trace information at the same time.

Obtaining a representative trace requires some knowledge of what theprogram does. Someone needs to supply inputs that properly exercise theprogram. Program users should have knowledge of the types of datatypically presented to the program.

However, because they may not know which data inputs will causeworst-case behavior, some collaboration between the program designersand users may be necessary. The techniques to be described later inthis series can be used to measure how thoroughly a set of inputscovers the program's control and data flow, giving you an idea of therepresentativeness of your traces.

Trace Analysis
Once we have a trace, several things can be done with it. At minimum,the trace tells us the execution path of the program. By assigning theproper execution times to the instructions in the trace (taking intoaccount interactions between instructions, etc.), we can determine thetotal execution time from the trace. In the case of a trace that doesnot record the complete execution of a program, we have to infer thetrace through the missing pieces.

However, if the sampling interval is properly chosen, the missingprogram fragments can be easily determined. Traces are often used toanalyze cache behavior, allowing us to calculate memory access timesthroughout the code. This information can be used to select the propersize of the cache to be used with a CPU (either by selecting the properCPU model with an internal cache or by adding an external cache). Itcan also be used to optimize the program to improve its cache behavior.

A trace is generated on a particular CPU. We may want to use thetrace to analyze the performance of the program on a different CPU. Forexample, when selecting which microprocessor to use, we may not want topurchase a complete evaluation and compilation system for eachpotential CPU just to measure the performance of each on our programs.

There are limits to what we can do in this case, but the trace canstill be useful. We can use the trace to determine the execution paththrough the program, although we have to be careful that the compilersused for the two CPUs perform similar optimizations.

We can approximate the execution time on the new CPU, particularlyif the instruction sets of the CPUs are relatively simple. Since manyRISC processors have similar basic instructions with similar executiontimes (and because many compilers useonly a small fraction of the CPU's instruction set ), we canoften obtain reasonable estimates of execution time on the new CPU byexamining the trace from another CPU.

Loops are important targets for optimization because programs withloops tend to spend a lot of time executing those loops. There arethree important techniques in optimizing loops: code motion, inductionvariable elimination, and strength reduction.

Code motion lets us move unnecessary code out of a loop. If acomputation's result does not depend on operations performed in theloop body, then we can safely move it out of the loop. Code motionopportunities can arise because programmers may find some computationsclearer and more concise when put in the loop body, even though theyare not strictly dependent on the loop iterations. A simple example ofcode motion is also common. Consider the following loop:

for (i = 0; i < N*M; i++) {
z[i] = a[i] + b[i];

The code motion opportunity becomes more obvious when we draw theloop's CDFG as shown in Figure 5-23below . The loop bound computation is performed on everyiteration during the loop test, even though the result never changes.We can avoid N × M ' 1 unnecessary executions of this statementby moving it before the loop, as shown in the figure.

Figure5-23. Code motion in a loop

An induction variable is a variable whose value isderived from the loop iteration variable's value. The compiler oftenintroduces induction variables to help it implement the loop. Properlytransformed, we may be able to eliminate some variables and applystrength reduction to others.

A nested loop is a good example of the use of induction variables. Asimple nested loop appears below.

for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
z[i][j] = b[i][j];

The compiler uses induction variables to help it address the arrays.Let us rewrite the loop in C using induction variables and pointers.(Later, we use a common induction variable for the two arrays, eventhough the compiler would probably introduce separate inductionvariables and then merge them.)

for (i = 0; i < N; i++)
for (j = 0; j < M; j++) {
zbinduct = i*M + j;
*(zptr + zbinduct) = *(bptr + zbinduct);

In the above code, zptr and bptr are pointers to the heads of the zand b arrays and zbinduct is the shared induction variable. However, wedo not need to compute zbinduct afresh each time. Since we are steppingthrough the arrays sequentially, we can simply add the update value tothe induction variable as follows:

zbinduct = 0; 
for (i = 0; i < N; i++) {
for (j = 0; j < M; j++) {
*(zptr + zbinduct) = *(bptr + zbinduct);

This is a form of strength reduction since we have eliminated themultiplication from the induction variable computation.

Strength reduction helps us reduce the cost of a loop iteration.Consider the following assignment:

y = x * 2;

In integer arithmetic, we can use a left shift rather than amultiplication by 2 (as long as we properly keep track of overflows).If the shift is faster than the multiply, we probably want to performthe substitution. This optimization can often be used with inductionvariables because loops are often indexed with simple expressions.Strength reduction can often be performed with simple substitutionrules since there are relatively few interactions between the possiblesubstitutions.

A loop nest is a set of loops, one inside the other,as illustrated in Figure 5-24 below. Loop nests occur when we process arrays. A large body of techniques hasbeen developed for optimizing loop nests. Rewriting a loop nest changesthe order in which array elements are accessed. This can expose newparallelism opportunities that can be exploited by later stages of thecompiler, and it can also improve cache performance. In this section weconcentrate on the analysis of loop nests for cache performance.

Figure5-24. Vector representation of nexted loops

Let us first consider the examples of Figure 5-24 above in more detail. Wecan represent the set of array locations indexed by the loop by an iterationspace . Each point in this space is defined by a vector, eachcomponent of which is one of the loop's iteration variables.

The points in the space represent the iterations executed by theprogram. The top example in the figure accesses every element of thearrays (the a and b matrices have the same access patterns in thiscase). The lower example accesses bands in the a and b matrices. We canrepresent any particular point in the iteration space as an iterationvector , such as the vector i 3 ,2 = <3, 2> shown in the figure.

Cashemiss equations. Ghosh et al. [Gho97] developed an analytical technique for describing data cache misses innested loops. The equations not only describe the set of cache missesbut also help us to optimize the program's access patterns.

Because numerous equations are necessary to describe a program, theequations are designed to be generated and solved by design tools. Herewe concentrate on the basic principles underlying the equations. Afterunderstanding these principles, we can apply them by hand in smallcases as well as understand what our tools are doing in large cases.

We need to write equations that describe all types of misses,including compulsory, capacity, and conflict. The equations are writtenin terms of reuse vectors —if two iterations i 1 and i 2 use the same array location, the reusevector that defines that reuse is r = i 2 ' i 1 (assuming that i 2 comesafter i 1 ).

Assume for the moment that our cache is direct mapped; we need toknow the cache's line size so that we can compute which cache line eachmemory reference falls into. Compulsory misses occur in two cases – inthe first access along the reuse vector and when the reuse vector spansmore than one cache line.

Capacity and conflict misses can be captured in a single form ofequation. In the following equation, two references i and j conflict if they access distinct memory lines that map to the samecache line:

L(i) = L(j) . (EQ 5-1)

If the cache size is Cs and the line size is Ls, we can rewrite thisequation as

mem(i) = mem(j)+nCs+b . (EQ 5-2)

In Equation 5-2 above , n is an integer that represents an arbitrary offset and b is an integerin the range ” Loff b Ls ” 1 ” Loff ,where Loff is the offset of the memory referencein the cache line.

We can use these equations for a variety of purposes. For instance,if we want to minimize the number of conflicts between array accesses,we can parameterize the equations in terms of the starting addresses ofthe arrays and the number of elements in each row (by padding thearrays), and then find the values for these parameters that minimizethe number of cache conflicts. Example5-10 below  illustrates the concepts behind cache missequations.

Example 5-10 DataRealignment and Array Padding
Assume we want to optimize the cache behavior of the following code:

for (j = 0; j < M; j++)
for (i = 0; i < N; i++)
a[j][i] = b[j][i] * c;

Let us also assume that the a and b arrays are sized with M at 265and N at 4 and a 256-line, four-way set-associative cache with fourwords per line. Even though this code does not reuse any data elements,cache conflicts can cause serious performance problems because theyinterfere with spatial reuse at the cache line level.

Assume that the starting location for a[] is 1024 and the startinglocation for b[] is 4099. Although a[0][0] and b[0][0] do not map tothe same word in the cache, they do map to the same block as shownbelow .

As a result, we see the following scenario in execution:

  • The access to a[0][0] brings in the first four words of a[].
  • The access to b[0][0] replaces a[0] through a[0][3] with b[0][3]and the contents of the three locations before b[].
  • When a[0][1] is accessed, the same cache line is again replacedwith the first four elements of a[].

Once the a[0][1] access brings that line into the cache, itremainsthere for the a[0][2] and a[0][3] accesses since the b[] accesses arenow on the next line. However, the scenario repeats itself at a[0][4]and every four iterations of the cache. One way to eliminate the cacheconflicts is to move one of the arrays. We do not have to move it far.If we move b's start to 4100, we eliminate the cache conflicts.

However, that fix won't work in more complex situations. Moving onearray may only introduce cache conflicts with another array. In suchcases, we can use another technique called padding. If we extend eachof the rows of the arrays to have four elements rather than three, withthe padding word placed at the beginning of the row, we eliminate thecache conflicts. In this case, b[0][0] is located at 4100 by thepadding.

Although padding wastes memory, it substantially improves memoryperformance. In complex situations with multiple arrays andsophisticated access patterns, we have to use a combination oftechniques – relocating arrays and padding them – to be able tominimize cache conflicts.

Optimizing for execution speed
If you need to improve the speed of your program, you can attack theproblem at several levels of abstraction. First, make sure that thecode really needs to be accelerated. If you are dealing with a largeprogram, the part of the program using the most time may not beobvious.

Profiling the program will help you find hot spots. Many systems arebuilt from multiple programs running as communicating processes; a slowprogram may not actually limit the overall system performance.

You may be able to redesign your algorithm to improve efficiency.Examining asymptotic performance is often a good guide to efficiency.Doing fewer operations is usually the key to performance. In a fewcases, however, brute force may provide a better implementation. Aseemingly simple high-level language statement may in fact hide a verylong sequence of operations that slows down the algorithm.

Using dynamically allocated memory is one example, since managingthe heap takes time but is hidden from the programmer. For example, asophisticated algorithm that uses dynamic storage may be slower inpractice than an algorithm that performs more operations on staticallyallocated memory.

Finally, you can look at the implementation of the program itself. Afew hints on program implementation are summarized below.

  • Try to use registers efficiently. Group accesses to a valuetogether so that the value can be brought into a register and keptthere.
  • Make use of page mode accesses in the memory system wheneverpossible. Page mode reads and writes eliminate one step in the memoryaccess. You can increase use of page mode by rearranging your variablesso that more can be referenced contiguously.
  • Analyze cache behavior to find major cache conflicts. Restructurethe code to eliminate as many of these as you can as follows:

– For instruction conflicts, if the offending code segment issmall, try to rewrite the segment to make it as small as possible sothat it better fits into the cache. Writing in assembly language may benecessary. For conflicts across larger spans of code, try moving theinstructions or padding with NOPs.
– For scalar data conflicts, move the data values to differentlocations to reduce conflicts.
– For array data conflicts, consider either moving the arrays orchanging your array access patterns to reduce conflicts.

Next in Part 8:  Analysis andoptimization of energy, power amd program design
To read Part 1, go to “Programdesign and analysis.”
To read Part 2 , go to ” Modelsof program, assemblers and linkers.”
To read Part 3, go to “BasicCompilation Techniques”
To read Part 4, go to “The creation ofprocedures”
To read Part 5, go to  “Registerallocation and scheduling”
To read Part 6, go to “Analysis andoptimization of execution time”

Used with the permission of thepublisher, Newnes/Elsevier, this series of six articles is based oncopyrighted material from “Computersas Components: Principles of Embedded Computer System Design” byWayne Wolf. The book can be purchased on line.

Wayne Wolf  is currently the Georgia Research Alliance EminentScholar holding the Rhesa “Ray” S. Farmer, Jr., Distinguished Chair inEmbedded Computer Systems at Georgia Tech's School of Electrical andComputer Engineering (ECE). Previously a professor of electricalengineering at Princeton University, he worked at AT&T BellLaboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing and of DesignAutomation for Embedded Systems.

[Gho97] Somnath Ghosh,Margaret Martonosi and Sharad Malik “Cachemiss equations: an analytical representation of cache misses.”Proceedings of the 11th ACM International Conference on Supercomputing,ACM Press, 1997
[Col97] Robet Collins,”In-circuit emulation,” Dr. Dobb's Journal, September, 1997. pp.111-113.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.