However, because disks to hold the traces are inexpensive and workstations to analyze them are fast and cheap, traces are an appealing way to analyze programs. We have to consider two problems: how to generate a trace and how to use it to determine program performance.
A trace can be captured by using hardware methods. For example, a
logic analyzer can be attached to the microprocessor bus to capture
memory bus traffic. There are two major limitations to this approach.
First, the size of hardware trace buffers is usually limited.
The hardware must be able to store instructions at execution speed; if we use high-speed RAM for the buffer, we are limited to perhaps several million instructions. This means that we probably cannot capture a trace of the entire program but only a section of its execution. Second, if the CPU has an on-chip cache, we will not be able to see internal memory references to the program or the data.
Alternatively, a microprocessor emulator can be used to trace the PC over time. Although this technique allows us to see what is going on inside the chip and not just behavior on the memory bus, it is too slow to generate long traces. Since the microprocessor must be stopped for probing and probing is very slow compared to normal execution, capturing long traces would require exorbitant amounts of time.
Some CPUs have hardware facilities for automatically generating trace information. For example, the Pentium family microprocessors generate a special bus cycle, a branch trace message, that shows the source and/or destination address of a branch [Col97]. If we record only traces, we can reconstruct the instructions executed within the basic blocks while greatly reducing the amount of memory required to hold the trace.
There are three major methods for generating a trace by software: PC
sampling, instrumentation instructions, and simulation.
The PC sampling technique uses the operating system or some other
facility to periodically interrupt the program and save away the
current PC value.
This method has the same limitations of any sampling method - if the
sampling period is too slow, we may miss important behavior; in
particular, the wrong sampling interval can be a multiple of some
periodic behavior in the program that could be totally missed.
Alternatively, instructions can be added to the program that will write
out information to the trace buffer.
The instructions required depend on the information to be captured.
The simplest case is to add code to the start of each basic block that
remembers the program execution going through that point. Traces are
often used to analyze cache or superscalar behavior, in which case the
instrumentation code may need to save additional information about the
state of the CPU.
Software-based trace generation methods do add execution overhead time to the program, but this usually does not affect the measurement. Simulation methods interpret the instructions to simulate the CPU, and they can generate the required trace information at the same time.
Obtaining a representative trace requires some knowledge of what the
program does. Someone needs to supply inputs that properly exercise the
program. Program users should have knowledge of the types of data
typically presented to the program.
However, because they may not know which data inputs will cause
worst-case behavior, some collaboration between the program designers
and users may be necessary. The techniques to be described later in
this series can be used to measure how thoroughly a set of inputs
covers the program's control and data flow, giving you an idea of the
representativeness of your traces.
Trace Analysis
Once we have a trace, several things can be done with it. At minimum,
the trace tells us the execution path of the program. By assigning the
proper execution times to the instructions in the trace (taking into
account interactions between instructions, etc.), we can determine the
total execution time from the trace. In the case of a trace that does
not record the complete execution of a program, we have to infer the
trace through the missing pieces.
However, if the sampling interval is properly chosen, the missing program fragments can be easily determined. Traces are often used to analyze cache behavior, allowing us to calculate memory access times throughout the code. This information can be used to select the proper size of the cache to be used with a CPU (either by selecting the proper CPU model with an internal cache or by adding an external cache). It can also be used to optimize the program to improve its cache behavior.
A trace is generated on a particular CPU. We may want to use the
trace to analyze the performance of the program on a different CPU. For
example, when selecting which microprocessor to use, we may not want to
purchase a complete evaluation and compilation system for each
potential CPU just to measure the performance of each on our programs.
There are limits to what we can do in this case, but the trace can
still be useful. We can use the trace to determine the execution path
through the program, although we have to be careful that the compilers
used for the two CPUs perform similar optimizations.
We can approximate the execution time on the new CPU, particularly if the instruction sets of the CPUs are relatively simple. Since many RISC processors have similar basic instructions with similar execution times (and because many compilers use only a small fraction of the CPU's instruction set), we can often obtain reasonable estimates of execution time on the new CPU by examining the trace from another CPU.
Loop
Optimizations
Loops are important targets for optimization because programs with
loops tend to spend a lot of time executing those loops. There are
three important techniques in optimizing loops: code motion, induction
variable elimination, and strength reduction.
Code motion lets us move unnecessary code out of a loop. If a computation's result does not depend on operations performed in the loop body, then we can safely move it out of the loop. Code motion opportunities can arise because programmers may find some computations clearer and more concise when put in the loop body, even though they are not strictly dependent on the loop iterations. A simple example of code motion is also common. Consider the following loop:
for (i = 0; i < N*M; i++) {
z[i] = a[i] + b[i];
}
The code motion opportunity becomes more obvious when we draw the loop's CDFG as shown in Figure 5-23 below. The loop bound computation is performed on every iteration during the loop test, even though the result never changes. We can avoid N × M ' 1 unnecessary executions of this statement by moving it before the loop, as shown in the figure.
![]() |
| Figure 5-23. Code motion in a loop |
An induction variable is a variable whose value is derived from the loop iteration variable's value. The compiler often introduces induction variables to help it implement the loop. Properly transformed, we may be able to eliminate some variables and apply strength reduction to others.
A nested loop is a good example of the use of induction variables. A simple nested loop appears below.
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
z[i][j] = b[i][j];
The compiler uses induction variables to help it address the arrays. Let us rewrite the loop in C using induction variables and pointers. (Later, we use a common induction variable for the two arrays, even though the compiler would probably introduce separate induction variables and then merge them.)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++) {
zbinduct = i*M + j;
*(zptr + zbinduct) = *(bptr + zbinduct);
}
In the above code, zptr and bptr are pointers to the heads of the z and b arrays and zbinduct is the shared induction variable. However, we do not need to compute zbinduct afresh each time. Since we are stepping through the arrays sequentially, we can simply add the update value to the induction variable as follows:
zbinduct = 0;
for (i = 0; i < N; i++) {
for (j = 0; j < M; j++) {
*(zptr + zbinduct) = *(bptr + zbinduct);
zbinduct++;
}
}
This is a form of strength reduction since we have eliminated the multiplication from the induction variable computation.
Strength reduction helps us reduce the cost of a loop iteration. Consider the following assignment:
y = x * 2;
In integer arithmetic, we can use a left shift rather than a
multiplication by 2 (as long as we properly keep track of overflows).
If the shift is faster than the multiply, we probably want to perform
the substitution. This optimization can often be used with induction
variables because loops are often indexed with simple expressions.
Strength reduction can often be performed with simple substitution
rules since there are relatively few interactions between the possible
substitutions.
Cache
optimizations
A loop nest is a set of loops, one inside the other,
as illustrated in Figure 5-24 below.
Loop nests occur when we process arrays. A large body of techniques has
been developed for optimizing loop nests. Rewriting a loop nest changes
the order in which array elements are accessed. This can expose new
parallelism opportunities that can be exploited by later stages of the
compiler, and it can also improve cache performance. In this section we
concentrate on the analysis of loop nests for cache performance.
![]() |
| Figure 5-24. Vector representation of nexted loops |
Let us first consider the examples of Figure 5-24 above in more detail. We
can represent the set of array locations indexed by the loop by an iteration
space. Each point in this space is defined by a vector, each
component of which is one of the loop's iteration variables.
The points in the space represent the iterations executed by the program. The top example in the figure accesses every element of the arrays (the a and b matrices have the same access patterns in this case). The lower example accesses bands in the a and b matrices. We can represent any particular point in the iteration space as an iteration vector, such as the vector i3,2 = <3, 2> shown in the figure.
Cashe
miss equations. Ghosh et al. [Gho97]
developed an analytical technique for describing data cache misses in
nested loops. The equations not only describe the set of cache misses
but also help us to optimize the program's access patterns.
Because numerous equations are necessary to describe a program, the equations are designed to be generated and solved by design tools. Here we concentrate on the basic principles underlying the equations. After understanding these principles, we can apply them by hand in small cases as well as understand what our tools are doing in large cases.
We need to write equations that describe all types of misses,
including compulsory, capacity, and conflict. The equations are written
in terms of reuse vectors—if two iterations i1
and i2 use the same array location, the reuse
vector that defines that reuse is r = i2
' i1 (assuming that i2 comes
after i1).
Assume for the moment that our cache is direct mapped; we need to
know the cache's line size so that we can compute which cache line each
memory reference falls into. Compulsory misses occur in two cases - in
the first access along the reuse vector and when the reuse vector spans
more than one cache line.
Capacity and conflict misses can be captured in a single form of equation. In the following equation, two references i and j conflict if they access distinct memory lines that map to the same cache line:
L(i) = L(j). (EQ 5-1)
If the cache size is Cs and the line size is Ls, we can rewrite this equation as
mem(i) = mem(j)+nCs+b. (EQ 5-2)
In Equation 5-2 above, n is an integer that represents an arbitrary offset and b is an integer in the range " Loff b Ls " 1 " Loff, where Loff is the offset of the memory reference in the cache line.
We can use these equations for a variety of purposes. For instance, if we want to minimize the number of conflicts between array accesses, we can parameterize the equations in terms of the starting addresses of the arrays and the number of elements in each row (by padding the arrays), and then find the values for these parameters that minimize the number of cache conflicts. Example 5-10 below illustrates the concepts behind cache miss equations.
Example 5-10 Data Realignment and Array Paddingfor (j = 0; j < M; j++)
for (i = 0; i < N; i++)
a[j][i] = b[j][i] * c;
Let us also assume that the a and b arrays are sized with M at 265 and N at 4 and a 256-line, four-way set-associative cache with four words per line. Even though this code does not reuse any data elements, cache conflicts can cause serious performance problems because they interfere with spatial reuse at the cache line level.
Assume that the starting location for a[] is 1024 and the starting location for b[] is 4099. Although a[0][0] and b[0][0] do not map to the same word in the cache, they do map to the same block as shown below.

As a result, we see the following scenario in execution:
However, that fix won't work in more complex situations. Moving one
array may only introduce cache conflicts with another array. In such
cases, we can use another technique called padding. If we extend each
of the rows of the arrays to have four elements rather than three, with
the padding word placed at the beginning of the row, we eliminate the
cache conflicts. In this case, b[0][0] is located at 4100 by the
padding.
Although padding wastes memory, it substantially improves memory
performance. In complex situations with multiple arrays and
sophisticated access patterns, we have to use a combination of
techniques - relocating arrays and padding them - to be able to
minimize cache conflicts.
Optimizing for execution speed
If you need to improve the speed of your program, you can attack the
problem at several levels of abstraction. First, make sure that the
code really needs to be accelerated. If you are dealing with a large
program, the part of the program using the most time may not be
obvious.
Profiling the program will help you find hot spots. Many systems are built from multiple programs running as communicating processes; a slow program may not actually limit the overall system performance.
You may be able to redesign your algorithm to improve efficiency.
Examining asymptotic performance is often a good guide to efficiency.
Doing fewer operations is usually the key to performance. In a few
cases, however, brute force may provide a better implementation. A
seemingly simple high-level language statement may in fact hide a very
long sequence of operations that slows down the algorithm.
Using dynamically allocated memory is one example, since managing
the heap takes time but is hidden from the programmer. For example, a
sophisticated algorithm that uses dynamic storage may be slower in
practice than an algorithm that performs more operations on statically
allocated memory.
Finally, you can look at the implementation of the program itself. A few hints on program implementation are summarized below.
- For instruction conflicts, if the offending code segment is small, try to rewrite the segment to make it as small as possible so that it better fits into the cache. Writing in assembly language may be necessary. For conflicts across larger spans of code, try moving the instructions or padding with NOPs.
- For scalar data conflicts, move the data values to different locations to reduce conflicts.
- For array data conflicts, consider either moving the arrays or changing your array access patterns to reduce conflicts.
Next in Part 8: Analysis and
optimization of energy, power amd program design
To read Part 1, go to "Program
design and analysis."
To read Part 2 , go to " Models
of program, assemblers and linkers."
To read Part 3, go to "Basic
Compilation Techniques"
To read Part 4, go to "The creation of
procedures"
To read Part 5, go to "Register
allocation and scheduling"
To read Part 6, go to "Analysis and
optimization of execution time"
Used with the permission of the publisher, Newnes/Elsevier, this series of six articles is based on copyrighted material from "Computers as Components: Principles of Embedded Computer System Design" by Wayne Wolf. The book can be purchased on line.
Wayne Wolf is currently the Georgia Research Alliance Eminent
Scholar holding the Rhesa "Ray" S. Farmer, Jr., Distinguished Chair in
Embedded Computer Systems at Georgia Tech's School of Electrical and
Computer Engineering (ECE). Previously a professor of electrical
engineering at Princeton University, he worked at AT&T Bell
Laboratories. He has served as editor in chief of the ACM Transactions
on Embedded Computing and of Design
Automation for Embedded Systems.