Because embedded systems must perform functions in real time, we
often need to know how fast a program runs. The techniques we use to
analyze program execution time are also helpful in analyzing properties
such as power consumption.
In this part in the series, we study how to
analyze programs to estimate their run times. We also examine how to
optimize programs to improve their execution times; of course,
optimization relies on analysis.
Note the word "estimate"
in the last paragraph. While we might hope that the execution time of
programs could be precisely determined, this is in fact difficult to do
in practice. A few of the reasons for such difficulty are summarized
below.
- The execution time of a program often varies with the input data
values because those values select different execution paths in the
program. For example, loops may be executed a varying number of times,
and different branches may execute blocks of varying complexity.
- The cache has a major effect on program performance, and once
again, the cache's behavior depends in part on the data values input to
the program.
- Execution times may vary even at the instruction level.
Floating-point operations are the most sensitive to data values, but
the normal integer execution pipeline can also introduce data-dependent
variations. In general, the execution time of an instruction in a
pipeline depends not only on that instruction but on the instructions
around it in the pipeline.
Measuring Execution Speed
We can measure program performance in several ways: Some
microprocessor manufacturers supply simulators for their CPUs: The
simulator runs on a workstation or PC, takes as input an executable for
the microprocessor along with input data, and simulates the execution
of that program.
Some of these simulators go beyond functional simulation to measure
the execution time of the program. Simulation is clearly slower than
executing the program on the actual microprocessor, but it also
provides much greater visibility during execution. Be careful - some
microprocessor performance simulators are not 100% accurate, and
simulation of I/O-intensive code may be difficult.
- A timer connected to the microprocessor bus can be used to
measure performance of executing sections of code. The code to be
measured would reset and start the timer at its start and stop the
timer at the end of execution. The length of the program that can be
measured is limited by the accuracy of the timer.
- A logic analyzer can be connected to the microprocessor bus to
measure the start and stop times of a code segment. This technique
relies on the code being able to produce identifiable events on the bus
to identify the start and stop of execution. The length of code that
can be measured is limited by the size of the logic analyzer's buffer.
We are interested in the following three different types of
performance measures on programs:
- average-case execution time: This is the typical execution time
we would expect for typical data. Clearly, the first challenge is
defining typical inputs.
- worst-case execution time: The longest time that the program can
spend on any input sequence is clearly important for systems that must
meet deadlines. In some cases, the input set that causes the worst-case
execution time is obvious, but in many cases it is not.
- best-case execution time: This measure can be important in
multirate realtime systems.
First, we look at the fundamentals of program performance in more
detail. We then consider trace-driven performance based on executing
the program and observing its behavior.
Elements of Program Performance
The key to evaluating execution time is breaking the performance
problem into parts. Program execution time [Sha89] can be seen as
execution time = program path +
instruction timing
The path is the sequence of instructions executed by the program (or
its equivalent in the high-level language representation of the
program). The instruction timing is determined based on the sequence of
instructions traced by the program path, which takes into account data
dependencies, pipeline behavior, and caching. Luckily, these two
problems can be solved relatively independently.
Although we can trace the execution path of a program through its
highlevel language specification, it is hard to get accurate estimates
of total execution time from a high-level language program. This is
because there is not, as seen in earlier, a direct correspondence
between program statements and instructions. The number of memory
locations and variables must be estimated, and results may be either
saved for reuse or recomputed on the fly, among other effects. These
problems become more challenging as the compiler puts more and more
effort into optimizing the program. However, some aspects of program
performance can be estimated by looking directly at the C program. For
example, if a program contains a loop with a large, fixed iteration
bound or if one branch of a conditional is much longer than another, we
can get at least a rough idea that these are more time-consuming
segments of the program.
Of course, a precise estimate of performance also relies on the
instructions to be executed, since different instructions take
different amounts of time. (In addition, to make life even more
difficult, the execution time of one instruction can depend on the
instructions executed before and after it.) Example 5-8 below illustrates
data-dependent program paths.
Example 5-8 Data-Dependent
Paths in if Statements
Here is a pair of nested if statements:
if (a || b) { /* test 1 */
if (c) /* test 2 */
{ x = r * s + t; /* assignment 1 */ }
else { y = r + s; /* assignment 2 */ }
z = r + s + u; /* assignment 3 */
} else {
if (c) /* test 3 */
{ y = r ' t; /* assignment 4 */ }
}
The conditional tests and assignments
are labeled within each if statement to make it easier to identify
paths. What execution paths may be exercised? One way to enumerate all
the paths is to create a truth table"like structure. The paths are
controlled by the variables in the if conditions, namely, a, b, and c.
For any given combination of values of those variables, we can trace
through the program to see which branch is taken at each if and which
assignments are performed. For example, when a = 1, b = 0, and c = 1,
then test 1 is true and test 2 is true. This means we first perform
assignment 1 and then assignment 3.
Results for all the controlling variable values follow:
a b
c Path
0 0 0
test 1 false, test 3 false: no assignments
0 0 1
test 1 false, test 3 true: assignment 4
0 1 0
test 1 true, test 2 false: assignments 2, 3
0 1 1
test 1 true, test 2 true: assignments 1, 3
1 0 0
test 1 true, test 2 false: assignments 2, 3
1 0 1
test 1 true, test 2 true: assignments 1, 3
1 1 0
test 1 true, test 2 false: assignments 2, 3
1 1 1
test 1 true, test 2 true: assignments 1, 3
Notice that there are only four distinct cases: no assignment,
assignment 4, assignments 2 and 3, or assignments 1 and 3. These
correspond to the possible paths through the nested ifs; the table adds
value by telling us which variable values exercise each of these paths.
Enumerating the paths through a fixed-iteration for loop is
seemingly simple. In the code below,
for (i = 0; i
< N; i++)
a[i] = b[i]*c[i];
the assignment in the loop is performed exactly N times. However, we
can't forget the code executed to set up the loop and to test the
iteration variable. Example 5-9 below
illustrates how to determine the path through a loop.
Example 5-9 Paths
in a loop
The loop code for the FIR filter of Application Example 2-1 is
reproduced below:
for (i = 0, f = 0; i
< N; i++)
f = f + c[i] * x[i];
By examining the control/data flow graph (CDFG) for the code we can
more easily determine how many times various statements are executed.
The CDFG makes it clear that the loop initiation block is executed
once, the test is executed N + 1 times, and the body and loop variable
update are each executed N times.
To measure the longest path length, we must find the longest path
through the optimized CDFG since the compiler may change the structure
of the control and data flow to optimize the program's implementation.
It is important to keep in mind that choosing the longest path through
a CDFG as measured by the number of nodes or edges touched may not
correspond to the longest execution time.
Since the execution time of a node in the CDFG will vary greatly
depending on the instructions represented by that node, we must keep in
mind that the longest path through the CDFG depends on the execution
times of the nodes. In general, it is good policy to choose several of
what we estimate are the longest paths through the program and measure
the lengths of all of them in sufficient detail to be sure that we have
in fact captured the longest path.
Instruction timing
Once we know the execution path of the program, we have to measure the
execution time of the instructions executed along that path. The
simplest estimate is to assume that every instruction takes the same
number of clock cycles, which means we need only count the instructions
and multiply by the per-instruction execution time to obtain the
program's total execution time. However, even ignoring cache effects,
this technique is simplistic for the reasons summarized below.
- Not all instructions take the same amount of time. Although RISC
architectures tend to provide uniform instruction execution times in
order to keep the CPU's pipeline full, even many RISC architectures
take different amounts of time to execute certain instructions.
Multiple load-store instructions are examples of longer-executing
instructions in the ARM architecture. Floating-point instructions show
especially wide variations in execution time - while basic multiply and
add operations are fast, some transcendental functions can take
thousands of cycles to execute.
- Execution times of instructions are not independent. The
execution time of one instruction depends on the instructions around
it. For example, many CPUs use register bypassing to speed up
instruction sequences when the result of one instruction is used in the
next instruction. As a result, the execution time of an instruction may
depend on whether its destination register is used as a source for the
next operation (or vice versa).
- The execution time of an instruction may depend on operand
values. This is clearly true of floating-point instructions in which a
different number of iterations may be required to calculate the result.
Other specialized instructions can, for example, perform a
data-dependent number of integer operations.
We can handle the first two problems more easily than the third. We
can look up instruction execution time in a table; the table will be
indexed by opcode and possibly by other parameter values such as the
registers used.
To handle interdependent execution times, we can add columns to the
table to consider the effects of nearby instructions. Since these
effects are generally limited by the size of the CPU pipeline, we know
that we need to consider a relatively small window of instructions to
handle such effects.
Handling variations due to operand values is difficult to do without
actually executing the program using a variety of data values, given
the large number of factors that can affect value-dependent instruction
timing. Luckily, these effects are often small. Even in floating-point
programs, most of the operations are typically additions and
multiplications whose execution times have small variances.
Cashing effects
Thus far we have not considered the effect of the cache. Because the
access time for main memory can be 10 times larger than the cache
access time, caching can have huge effects on instruction execution
time by changing both the instruction and data access times. Caching
performance inherently depends on the program's execution path since
the cache's contents depend on the history of accesses. Because of the
complexity of this phenomenon, we consider it in more detail below
using both trace-driven and static and dynamic analysis techniques.
Next in
Part 7:
Trace-driven
performance analysis
To read Part 1, go to "
Program
design and analysis."
To read Part 2 , go to "
Models
of program, assemblers and linkers."
To read Part 3, go to "
Basic
Compilation Techniques"
To read Part 4, go to "
The creation of
procedures"
To read Part 5, go to "
Register
allocation and scheduling"
Used with the permission
of the publisher, Newnes/Elsevier,
this series of six articles is based on copyrighted material from "Computers
as Components: Principles of Embedded Computer System Design" by
Wayne Wolf. The book
can be purchased on line.
Wayne Wolf is currently the Georgia
Research Alliance Eminent Scholar holding the Rhesa "Ray" S. Farmer,
Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's
School of Electrical and Computer Engineering (ECE). Previously a
professor of
electrical engineering at Princeton University, he worked at AT&T
Bell Laboratories. He has served as editor in chief of the ACM Transactions
on Embedded Computing and
of Design
Automation for Embedded Systems.
References:
[Sha89] Alan
C. Shaw, "Reasoning
about time in higher-level language software," IEEE
Transactions on
Software Engineering, , 15, July, 1989
Reader Response
I suggest another method to determine the execution time of a function. You need a spare I/O pin on the processor. At the beginning of the function, set the I/O pin high. At the end of the function, set it low. If you connect this pin to a scope, the pulse width measures how long it takes for the function to execute. The period of the signal measures how often the function is executed. The scope may be able to record minimum and maximum values of pulse width and period.
- Robert Drummey
Software Engineer
Harris Corp
Collegeville, PA