The basics of programming embedded processors: Part 6
Analysis and Optimization of Execution Time
By Wayne Wolf
Embedded.com
(08/29/07, 12:01:00 AM EDT)

Because embedded systems must perform functions in real time, we often need to know how fast a program runs. The techniques we use to analyze program execution time are also helpful in analyzing properties such as power consumption.

In this part in the series, we study how to analyze programs to estimate their run times. We also examine how to optimize programs to improve their execution times; of course, optimization relies on analysis.

Note the word "estimate" in the last paragraph. While we might hope that the execution time of programs could be precisely determined, this is in fact difficult to do in practice. A few of the reasons for such difficulty are summarized below.

Measuring Execution Speed
We can measure program performance in several ways: Some microprocessor manufacturers supply simulators for their CPUs: The simulator runs on a workstation or PC, takes as input an executable for the microprocessor along with input data, and simulates the execution of that program.

Some of these simulators go beyond functional simulation to measure the execution time of the program. Simulation is clearly slower than executing the program on the actual microprocessor, but it also provides much greater visibility during execution. Be careful - some microprocessor performance simulators are not 100% accurate, and simulation of I/O-intensive code may be difficult.

We are interested in the following three different types of performance measures on programs:

First, we look at the fundamentals of program performance in more detail. We then consider trace-driven performance based on executing the program and observing its behavior.

Elements of Program Performance
The key to evaluating execution time is breaking the performance problem into parts. Program execution time [Sha89] can be seen as

execution time = program path + instruction timing

The path is the sequence of instructions executed by the program (or its equivalent in the high-level language representation of the program). The instruction timing is determined based on the sequence of instructions traced by the program path, which takes into account data dependencies, pipeline behavior, and caching. Luckily, these two problems can be solved relatively independently.

Although we can trace the execution path of a program through its highlevel language specification, it is hard to get accurate estimates of total execution time from a high-level language program. This is because there is not, as seen in earlier, a direct correspondence between program statements and instructions. The number of memory locations and variables must be estimated, and results may be either saved for reuse or recomputed on the fly, among other effects. These problems become more challenging as the compiler puts more and more effort into optimizing the program. However, some aspects of program performance can be estimated by looking directly at the C program. For example, if a program contains a loop with a large, fixed iteration bound or if one branch of a conditional is much longer than another, we can get at least a rough idea that these are more time-consuming segments of the program.

Of course, a precise estimate of performance also relies on the instructions to be executed, since different instructions take different amounts of time. (In addition, to make life even more difficult, the execution time of one instruction can depend on the instructions executed before and after it.) Example 5-8 below illustrates data-dependent program paths.

Example 5-8 Data-Dependent Paths in if Statements
Here is a pair of nested if statements:

if (a || b) { /* test 1 */
if (c) /* test 2 */
{ x = r * s + t; /* assignment 1 */ }
else { y = r + s; /* assignment 2 */ }
z = r + s + u; /* assignment 3 */
} else {
if (c) /* test 3 */
{ y = r ' t; /* assignment 4 */ }
}

The conditional tests and assignments are labeled within each if statement to make it easier to identify paths. What execution paths may be exercised? One way to enumerate all the paths is to create a truth table"like structure. The paths are controlled by the variables in the if conditions, namely, a, b, and c. For any given combination of values of those variables, we can trace through the program to see which branch is taken at each if and which assignments are performed. For example, when a = 1, b = 0, and c = 1, then test 1 is true and test 2 is true. This means we first perform assignment 1 and then assignment 3.

Results for all the controlling variable values follow:

a         b     c     Path
0         0     0     test 1 false, test 3 false: no assignments
0         0     1     test 1 false, test 3 true: assignment 4
0         1     0     test 1 true, test 2 false: assignments 2, 3
0         1     1     test 1 true, test 2 true: assignments 1, 3
1         0     0     test 1 true, test 2 false: assignments 2, 3
1         0     1     test 1 true, test 2 true: assignments 1, 3
1         1     0     test 1 true, test 2 false: assignments 2, 3
1         1     1     test 1 true, test 2 true: assignments 1, 3

Notice that there are only four distinct cases: no assignment, assignment 4, assignments 2 and 3, or assignments 1 and 3. These correspond to the possible paths through the nested ifs; the table adds value by telling us which variable values exercise each of these paths.

Enumerating the paths through a fixed-iteration for loop is seemingly simple. In the code below,

for (i = 0; i < N; i++)
a[i] = b[i]*c[i];

the assignment in the loop is performed exactly N times. However, we can't forget the code executed to set up the loop and to test the iteration variable. Example 5-9 below illustrates how to determine the path through a loop.

Example 5-9 Paths in a loop
The loop code for the FIR filter of Application Example 2-1 is reproduced below:

for (i = 0, f = 0; i < N; i++)
f = f + c[i] * x[i];

By examining the control/data flow graph (CDFG) for the code we can more easily determine how many times various statements are executed. The CDFG makes it clear that the loop initiation block is executed once, the test is executed N + 1 times, and the body and loop variable update are each executed N times.

To measure the longest path length, we must find the longest path through the optimized CDFG since the compiler may change the structure of the control and data flow to optimize the program's implementation. It is important to keep in mind that choosing the longest path through a CDFG as measured by the number of nodes or edges touched may not correspond to the longest execution time.

Since the execution time of a node in the CDFG will vary greatly depending on the instructions represented by that node, we must keep in mind that the longest path through the CDFG depends on the execution times of the nodes. In general, it is good policy to choose several of what we estimate are the longest paths through the program and measure the lengths of all of them in sufficient detail to be sure that we have in fact captured the longest path.

Instruction timing
Once we know the execution path of the program, we have to measure the execution time of the instructions executed along that path. The simplest estimate is to assume that every instruction takes the same number of clock cycles, which means we need only count the instructions and multiply by the per-instruction execution time to obtain the program's total execution time. However, even ignoring cache effects, this technique is simplistic for the reasons summarized below.

We can handle the first two problems more easily than the third. We can look up instruction execution time in a table; the table will be indexed by opcode and possibly by other parameter values such as the registers used.

To handle interdependent execution times, we can add columns to the table to consider the effects of nearby instructions. Since these effects are generally limited by the size of the CPU pipeline, we know that we need to consider a relatively small window of instructions to handle such effects.

Handling variations due to operand values is difficult to do without actually executing the program using a variety of data values, given the large number of factors that can affect value-dependent instruction timing. Luckily, these effects are often small. Even in floating-point programs, most of the operations are typically additions and multiplications whose execution times have small variances.

Cashing effects
Thus far we have not considered the effect of the cache. Because the access time for main memory can be 10 times larger than the cache access time, caching can have huge effects on instruction execution time by changing both the instruction and data access times. Caching performance inherently depends on the program's execution path since the cache's contents depend on the history of accesses. Because of the complexity of this phenomenon, we consider it in more detail below using both trace-driven and static and dynamic analysis techniques.

Next in Part 7:  Trace-driven performance analysis
To read Part 1, go to "Program design and analysis."
To read Part 2 , go to " Models of program, assemblers and linkers."
To read Part 3, go to "Basic Compilation Techniques"
To read Part 4, go to "The creation of procedures"
To read Part 5, go to  "Register allocation and scheduling"

Used with the permission of the publisher, Newnes/Elsevier, this series of six articles is based on copyrighted material from "Computers as Components: Principles of Embedded Computer System Design" by Wayne Wolf. The book can be purchased on line.

Wayne Wolf  is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa "Ray" S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems.

References:
[Sha89] 
Alan C. Shaw, "Reasoning about time in higher-level language software," IEEE Transactions on Software Engineering, , 15, July, 1989

Reader Response


I suggest another method to determine the execution time of a function. You need a spare I/O pin on the processor. At the beginning of the function, set the I/O pin high. At the end of the function, set it low. If you connect this pin to a scope, the pulse width measures how long it takes for the function to execute. The period of the signal measures how often the function is executed. The scope may be able to record minimum and maximum values of pulse width and period.

- Robert Drummey
Software Engineer
Harris Corp
Collegeville, PA