CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

The basics of programming embedded processors: Part 7
Trace-Driven Performance Analysis



Embedded.com

Trace Analysis
Once we have a trace, several things can be done with it. At minimum, the trace tells us the execution path of the program. By assigning the proper execution times to the instructions in the trace (taking into account interactions between instructions, etc.), we can determine the total execution time from the trace. In the case of a trace that does not record the complete execution of a program, we have to infer the trace through the missing pieces.

However, if the sampling interval is properly chosen, the missing program fragments can be easily determined. Traces are often used to analyze cache behavior, allowing us to calculate memory access times throughout the code. This information can be used to select the proper size of the cache to be used with a CPU (either by selecting the proper CPU model with an internal cache or by adding an external cache). It can also be used to optimize the program to improve its cache behavior.

A trace is generated on a particular CPU. We may want to use the trace to analyze the performance of the program on a different CPU. For example, when selecting which microprocessor to use, we may not want to purchase a complete evaluation and compilation system for each potential CPU just to measure the performance of each on our programs.

There are limits to what we can do in this case, but the trace can still be useful. We can use the trace to determine the execution path through the program, although we have to be careful that the compilers used for the two CPUs perform similar optimizations.

We can approximate the execution time on the new CPU, particularly if the instruction sets of the CPUs are relatively simple. Since many RISC processors have similar basic instructions with similar execution times (and because many compilers use only a small fraction of the CPU's instruction set), we can often obtain reasonable estimates of execution time on the new CPU by examining the trace from another CPU.

Loop Optimizations
Loops are important targets for optimization because programs with loops tend to spend a lot of time executing those loops. There are three important techniques in optimizing loops: code motion, induction variable elimination, and strength reduction.

Code motion lets us move unnecessary code out of a loop. If a computation's result does not depend on operations performed in the loop body, then we can safely move it out of the loop. Code motion opportunities can arise because programmers may find some computations clearer and more concise when put in the loop body, even though they are not strictly dependent on the loop iterations. A simple example of code motion is also common. Consider the following loop:

for (i = 0; i < N*M; i++) {
z[i] = a[i] + b[i];
}

The code motion opportunity becomes more obvious when we draw the loop's CDFG as shown in Figure 5-23 below. The loop bound computation is performed on every iteration during the loop test, even though the result never changes. We can avoid N × M ' 1 unnecessary executions of this statement by moving it before the loop, as shown in the figure.

Figure 5-23. Code motion in a loop

An induction variable is a variable whose value is derived from the loop iteration variable's value. The compiler often introduces induction variables to help it implement the loop. Properly transformed, we may be able to eliminate some variables and apply strength reduction to others.

A nested loop is a good example of the use of induction variables. A simple nested loop appears below.

for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
z[i][j] = b[i][j];

The compiler uses induction variables to help it address the arrays. Let us rewrite the loop in C using induction variables and pointers. (Later, we use a common induction variable for the two arrays, even though the compiler would probably introduce separate induction variables and then merge them.)

for (i = 0; i < N; i++)
for (j = 0; j < M; j++) {
zbinduct = i*M + j;
*(zptr + zbinduct) = *(bptr + zbinduct);
}

In the above code, zptr and bptr are pointers to the heads of the z and b arrays and zbinduct is the shared induction variable. However, we do not need to compute zbinduct afresh each time. Since we are stepping through the arrays sequentially, we can simply add the update value to the induction variable as follows:

zbinduct = 0;
for (i = 0; i < N; i++) {
for (j = 0; j < M; j++) {
*(zptr + zbinduct) = *(bptr + zbinduct);
zbinduct++;
}
}

This is a form of strength reduction since we have eliminated the multiplication from the induction variable computation.

Strength reduction helps us reduce the cost of a loop iteration. Consider the following assignment:

y = x * 2;

In integer arithmetic, we can use a left shift rather than a multiplication by 2 (as long as we properly keep track of overflows). If the shift is faster than the multiply, we probably want to perform the substitution. This optimization can often be used with induction variables because loops are often indexed with simple expressions. Strength reduction can often be performed with simple substitution rules since there are relatively few interactions between the possible substitutions.

1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :