The basics of programming embedded processors: Part 6 - Embedded.com

The basics of programming embedded processors: Part 6

Because embedded systems must perform functions in real time, weoften need to know how fast a program runs. The techniques we use toanalyze program execution time are also helpful in analyzing propertiessuch as power consumption.

In this part in the series, we study how toanalyze programs to estimate their run times. We also examine how tooptimize programs to improve their execution times; of course,optimization relies on analysis.

Note the word “estimate “in the last paragraph. While we might hope that the execution time ofprograms could be precisely determined, this is in fact difficult to doin practice. A few of the reasons for such difficulty are summarizedbelow.

  • The execution time of a program often varies with the input datavalues because those values select different execution paths in theprogram. For example, loops may be executed a varying number of times,and different branches may execute blocks of varying complexity.
  • The cache has a major effect on program performance, and onceagain, the cache's behavior depends in part on the data values input tothe program.
  • Execution times may vary even at the instruction level.Floating-point operations are the most sensitive to data values, butthe normal integer execution pipeline can also introduce data-dependentvariations. In general, the execution time of an instruction in apipeline depends not only on that instruction but on the instructionsaround it in the pipeline.

Measuring Execution Speed
We can measure program performance in several ways: Somemicroprocessor manufacturers supply simulators for their CPUs: Thesimulator runs on a workstation or PC, takes as input an executable forthe microprocessor along with input data, and simulates the executionof that program.

Some of these simulators go beyond functional simulation to measurethe execution time of the program. Simulation is clearly slower thanexecuting the program on the actual microprocessor, but it alsoprovides much greater visibility during execution. Be careful – somemicroprocessor performance simulators are not 100% accurate, andsimulation of I/O-intensive code may be difficult.

  • A timer connected to the microprocessor bus can be used tomeasure performance of executing sections of code. The code to bemeasured would reset and start the timer at its start and stop thetimer at the end of execution. The length of the program that can bemeasured is limited by the accuracy of the timer.
  • A logic analyzer can be connected to the microprocessor bus tomeasure the start and stop times of a code segment. This techniquerelies on the code being able to produce identifiable events on the busto identify the start and stop of execution. The length of code thatcan be measured is limited by the size of the logic analyzer's buffer.

We are interested in the following three different types ofperformance measures on programs:

  • average-case execution time: This is the typical execution timewe would expect for typical data. Clearly, the first challenge isdefining typical inputs.
  • worst-case execution time: The longest time that the program canspend on any input sequence is clearly important for systems that mustmeet deadlines. In some cases, the input set that causes the worst-caseexecution time is obvious, but in many cases it is not.
  • best-case execution time: This measure can be important inmultirate realtime systems.

First, we look at the fundamentals of program performance in moredetail. We then consider trace-driven performance based on executingthe program and observing its behavior.

Elements of Program Performance
The key to evaluating execution time is breaking the performanceproblem into parts. Program execution time [Sha89] can be seen as

execution time = program path +instruction timing

The path is the sequence of instructions executed by the program (orits equivalent in the high-level language representation of theprogram). The instruction timing is determined based on the sequence ofinstructions traced by the program path, which takes into account datadependencies, pipeline behavior, and caching. Luckily, these twoproblems can be solved relatively independently.

Although we can trace the execution path of a program through itshighlevel language specification, it is hard to get accurate estimatesof total execution time from a high-level language program. This isbecause there is not, as seen in earlier, a direct correspondencebetween program statements and instructions. The number of memorylocations and variables must be estimated, and results may be eithersaved for reuse or recomputed on the fly, among other effects. Theseproblems become more challenging as the compiler puts more and moreeffort into optimizing the program. However, some aspects of programperformance can be estimated by looking directly at the C program. Forexample, if a program contains a loop with a large, fixed iterationbound or if one branch of a conditional is much longer than another, wecan get at least a rough idea that these are more time-consumingsegments of the program.

Of course, a precise estimate of performance also relies on theinstructions to be executed, since different instructions takedifferent amounts of time. (In addition, to make life even moredifficult, the execution time of one instruction can depend on theinstructions executed before and after it.) Example 5-8 below illustratesdata-dependent program paths.

Example 5-8 Data-DependentPaths in if Statements
Here is a pair of nested if statements:

if (a || b) { /* test 1 */
if (c) /* test 2 */
{ x = r * s + t; /* assignment 1 */ }
else { y = r + s; /* assignment 2 */ }
z = r + s + u; /* assignment 3 */
} else {
if (c) /* test 3 */
{ y = r ' t; /* assignment 4 */ }
}

The conditional tests and assignmentsare labeled within each if statement to make it easier to identifypaths. What execution paths may be exercised? One way to enumerate allthe paths is to create a truth table”like structure. The paths arecontrolled by the variables in the if conditions, namely, a, b, and c.For any given combination of values of those variables, we can tracethrough the program to see which branch is taken at each if and whichassignments are performed. For example, when a = 1, b = 0, and c = 1,then test 1 is true and test 2 is true. This means we first performassignment 1 and then assignment 3.

Results for all the controlling variable values follow:

a         b    c     Path
0         0     0    test 1 false, test 3 false: no assignments
0         0     1    test 1 false, test 3 true: assignment 4
0         1     0    test 1 true, test 2 false: assignments 2, 3
0         1     1    test 1 true, test 2 true: assignments 1, 3
1         0     0    test 1 true, test 2 false: assignments 2, 3
1         0     1    test 1 true, test 2 true: assignments 1, 3
1         1     0    test 1 true, test 2 false: assignments 2, 3
1         1     1    test 1 true, test 2 true: assignments 1, 3

Notice that there are only four distinct cases: no assignment,assignment 4, assignments 2 and 3, or assignments 1 and 3. Thesecorrespond to the possible paths through the nested ifs; the table addsvalue by telling us which variable values exercise each of these paths.

Enumerating the paths through a fixed-iteration for loop isseemingly simple. In the code below,

for (i = 0; i< N; i++)
a[i] = b[i]*c[i];

the assignment in the loop is performed exactly N times. However, wecan't forget the code executed to set up the loop and to test theiteration variable. Example 5-9 below illustrates how to determine the path through a loop.

Example 5-9 Pathsin a loop
The loop code for the FIR filter of Application Example 2-1 isreproduced below:

for (i = 0, f = 0; i< N; i++)
f = f + c[i] * x[i];

By examining the control/data flow graph (CDFG) for the code we canmore easily determine how many times various statements are executed.The CDFG makes it clear that the loop initiation block is executedonce, the test is executed N + 1 times, and the body and loop variableupdate are each executed N times.

To measure the longest path length, we must find the longest paththrough the optimized CDFG since the compiler may change the structureof the control and data flow to optimize the program's implementation.It is important to keep in mind that choosing the longest path througha CDFG as measured by the number of nodes or edges touched may notcorrespond to the longest execution time.

Since the execution time of a node in the CDFG will vary greatlydepending on the instructions represented by that node, we must keep inmind that the longest path through the CDFG depends on the executiontimes of the nodes. In general, it is good policy to choose several ofwhat we estimate are the longest paths through the program and measurethe lengths of all of them in sufficient detail to be sure that we havein fact captured the longest path.

Instruction timing
Once we know the execution path of the program, we have to measure theexecution time of the instructions executed along that path. Thesimplest estimate is to assume that every instruction takes the samenumber of clock cycles, which means we need only count the instructionsand multiply by the per-instruction execution time to obtain theprogram's total execution time. However, even ignoring cache effects,this technique is simplistic for the reasons summarized below.

  • Not all instructions take the same amount of time. Although RISCarchitectures tend to provide uniform instruction execution times inorder to keep the CPU's pipeline full, even many RISC architecturestake different amounts of time to execute certain instructions.Multiple load-store instructions are examples of longer-executinginstructions in the ARM architecture. Floating-point instructions showespecially wide variations in execution time – while basic multiply andadd operations are fast, some transcendental functions can takethousands of cycles to execute.
  • Execution times of instructions are not independent. Theexecution time of one instruction depends on the instructions aroundit. For example, many CPUs use register bypassing to speed upinstruction sequences when the result of one instruction is used in thenext instruction. As a result, the execution time of an instruction maydepend on whether its destination register is used as a source for thenext operation (or vice versa).
  • The execution time of an instruction may depend on operandvalues. This is clearly true of floating-point instructions in which adifferent number of iterations may be required to calculate the result.Other specialized instructions can, for example, perform adata-dependent number of integer operations.

We can handle the first two problems more easily than the third. Wecan look up instruction execution time in a table; the table will beindexed by opcode and possibly by other parameter values such as theregisters used.

To handle interdependent execution times, we can add columns to thetable to consider the effects of nearby instructions. Since theseeffects are generally limited by the size of the CPU pipeline, we knowthat we need to consider a relatively small window of instructions tohandle such effects.

Handling variations due to operand values is difficult to do withoutactually executing the program using a variety of data values, giventhe large number of factors that can affect value-dependent instructiontiming. Luckily, these effects are often small. Even in floating-pointprograms, most of the operations are typically additions andmultiplications whose execution times have small variances.

Cashing effects
Thus far we have not considered the effect of the cache. Because theaccess time for main memory can be 10 times larger than the cacheaccess time, caching can have huge effects on instruction executiontime by changing both the instruction and data access times. Cachingperformance inherently depends on the program's execution path sincethe cache's contents depend on the history of accesses. Because of thecomplexity of this phenomenon, we consider it in more detail belowusing both trace-driven and static and dynamic analysis techniques.

Next inPart 7:  Trace-drivenperformance analysis
To read Part 1, go to “Programdesign and analysis.”
To read Part 2 , go to ” Modelsof program, assemblers and linkers.”
To read Part 3, go to “BasicCompilation Techniques”
To read Part 4, go to “The creation ofprocedures”
To read Part 5, go to  “Registerallocation and scheduling”

Used with the permissionof the publisher, Newnes/Elsevier,this series of six articles is based on copyrighted material from “Computersas Components: Principles of Embedded Computer System Design” byWayne Wolf. The bookcan be purchased on line.

Wayne Wolf  is currently the GeorgiaResearch Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer,Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech'sSchool of Electrical and Computer Engineering (ECE). Previously aprofessor ofelectrical engineering at Princeton University, he worked at AT&TBell Laboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing andof DesignAutomation for Embedded Systems.

References:
[Sha89] 
AlanC. Shaw, Reasoningabout time in higher-level language software,” IEEETransactions onSoftware Engineering, , 15, July, 1989

Reader Response


I suggest another method to determine the execution time of a function. You need a spare I/O pin on the processor. At the beginning of the function, set the I/O pin high. At the end of the function, set it low. If you connect this pin to a scope, the pulse width measures how long it takes for the function to execute. The period of the signal measures how often the function is executed. The scope may be able to record minimum and maximum values of pulse width and period.

– Robert Drummey
Software Engineer
Harris Corp
Collegeville, PA


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.