Trace Analysis
Once we have a trace, several things can be done with it. At minimum,
the trace tells us the execution path of the program. By assigning the
proper execution times to the instructions in the trace (taking into
account interactions between instructions, etc.), we can determine the
total execution time from the trace. In the case of a trace that does
not record the complete execution of a program, we have to infer the
trace through the missing pieces.
However, if the sampling interval is properly chosen, the missing
program fragments can be easily determined. Traces are often used to
analyze cache behavior, allowing us to calculate memory access times
throughout the code. This information can be used to select the proper
size of the cache to be used with a CPU (either by selecting the proper
CPU model with an internal cache or by adding an external cache). It
can also be used to optimize the program to improve its cache behavior.
A trace is generated on a particular CPU. We may want to use the
trace to analyze the performance of the program on a different CPU. For
example, when selecting which microprocessor to use, we may not want to
purchase a complete evaluation and compilation system for each
potential CPU just to measure the performance of each on our programs.
There are limits to what we can do in this case, but the trace can
still be useful. We can use the trace to determine the execution path
through the program, although we have to be careful that the compilers
used for the two CPUs perform similar optimizations.
We can approximate the execution time on the new CPU, particularly
if the instruction sets of the CPUs are relatively simple. Since many
RISC processors have similar basic instructions with similar execution
times (and because many compilers use
only a small fraction of the CPU's instruction set), we can
often obtain reasonable estimates of execution time on the new CPU by
examining the trace from another CPU.
Loop
Optimizations
Loops are important targets for optimization because programs with
loops tend to spend a lot of time executing those loops. There are
three important techniques in optimizing loops: code motion, induction
variable elimination, and strength reduction.
Code motion lets us move unnecessary code out of a loop. If a
computation's result does not depend on operations performed in the
loop body, then we can safely move it out of the loop. Code motion
opportunities can arise because programmers may find some computations
clearer and more concise when put in the loop body, even though they
are not strictly dependent on the loop iterations. A simple example of
code motion is also common. Consider the following loop:
for (i = 0; i < N*M; i++) {
z[i] = a[i] + b[i];
}
The code motion opportunity becomes more obvious when we draw the
loop's CDFG as shown in Figure 5-23
below. The loop bound computation is performed on every
iteration during the loop test, even though the result never changes.
We can avoid N × M ' 1 unnecessary executions of this statement
by moving it before the loop, as shown in the figure.
 |
| Figure
5-23. Code motion in a loop |
An induction variable is a variable whose value is
derived from the loop iteration variable's value. The compiler often
introduces induction variables to help it implement the loop. Properly
transformed, we may be able to eliminate some variables and apply
strength reduction to others.
A nested loop is a good example of the use of induction variables. A
simple nested loop appears below.
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
z[i][j] = b[i][j];
The compiler uses induction variables to help it address the arrays.
Let us rewrite the loop in C using induction variables and pointers.
(Later, we use a common induction variable for the two arrays, even
though the compiler would probably introduce separate induction
variables and then merge them.)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++) {
zbinduct = i*M + j;
*(zptr + zbinduct) = *(bptr + zbinduct);
}
In the above code, zptr and bptr are pointers to the heads of the z
and b arrays and zbinduct is the shared induction variable. However, we
do not need to compute zbinduct afresh each time. Since we are stepping
through the arrays sequentially, we can simply add the update value to
the induction variable as follows:
zbinduct = 0;
for (i = 0; i < N; i++) {
for (j = 0; j < M; j++) {
*(zptr + zbinduct) = *(bptr + zbinduct);
zbinduct++;
}
}
This is a form of strength reduction since we have eliminated the
multiplication from the induction variable computation.
Strength reduction helps us reduce the cost of a loop iteration.
Consider the following assignment:
y = x * 2;
In integer arithmetic, we can use a left shift rather than a
multiplication by 2 (as long as we properly keep track of overflows).
If the shift is faster than the multiply, we probably want to perform
the substitution. This optimization can often be used with induction
variables because loops are often indexed with simple expressions.
Strength reduction can often be performed with simple substitution
rules since there are relatively few interactions between the possible
substitutions.