The fundamentals of software performance analysis: Part 1 - Path analysis - Embedded.com

The fundamentals of software performance analysis: Part 1 – Path analysis

Editor’s Note: In the first in a two part series from High Performance Embedded Computing , author Wayne Wolf discusses the best means by which to measure worst-case, best-case, and average-case execution time in evaluating software performance. Part 1: Path analysis

Just as we need to know the speed at which hardware modules execute to design a hardware system, we need to analyze the performance of programs to build complex software systems. Unfortunately, software performance analysis is much more difficult than hardware timing analysis.

Synchronous hardware design restricts the structure of digital logic so that we can accurately bound delay with reasonable effort. Software designers, in contrast, insist on the full power of Turing machines, which makes it much more difficult to find tight bounds on execution time.

We generally measure program execution time from program initiation at presentation of some inputs to termination at the delivery of the last outputs. Several different measures of software performance are of interest.

  • Worst-case execution time (WCET)—the longest execution time for any possible combination of inputs
  • Best-case execution time (BCET)—the shortest execution time for any possible combination of inputs
  • Average-case execution time for typical inputs

Average-case performance is used for very different purposes than are worstcase and best-case execution times. Average performance is used to tune software (and perhaps the underlying hardware platform); average performance data is particularly useful when it is recorded not just for the entire program but for program units. Average-case behavior can be used to find hot spots due to poor algorithms, bad coding, poor choice of instructions, or other causes.

Average performance is typically evaluated using CPU simulators.. Worst-case and best-case execution times, in contrast, are used during schedulability analysis. Many scheduling algorithms and analysis methods guarantee schedulability based on knowledge of execution time.

Many scheduling algorithms assume that the execution time of a program is constant, which is not realistic given the data-dependent behavior of most interesting programs. WCET is often used as a substitute for exact execution time. However, short execution times can cause scheduling problems that do not occur when all programs run to the worst-case limits, so a more helpful strategy is to measure both worst-case and best-case execution times.

We cannot bound execution times through simulation because interesting programs have far too many input combinations to enumerate. We need to analyze the program to determine its timing behavior. When analyzing hardware to determine its maximum clock rate, we rely on the fact that combinational logic is generally acyclic to simplify analysis. When we analyze programs, we must handle arbitrary control flows. As illustrated in Figure 3-26 , performance analysis algorithms must handle not only branches but also loops and other types of cyclic control flow.

Figure 3-26: Execution paths through a program

The basic approach to worst-case execution time analysis still followed today was suggested by Shaw [Sha89] and extended by Park and Shaw [Par91]. They broke WCET analysis into two phases:

  1. Analyze the program to find the worst-case path through a program. This problem is sometimes called path analysis.
  2. Measure the execution time along the worst-case path. Let us call this problem path timing.

The initial step in this methodology was to efficiently identify the worst-case execution path through the program. Park and Shaw developed a performance model of a 68000 that gave them good results. Unfortunately, modern high-performance microprocessors are much more complex, and the timing of instruction sequences is much more difficult. Research over the past 20 years has improved both the path identification process and the processor performance models.

Path analysis and path timing interact to some extent for complex machines. If the execution time of an instruction depends on the previous instructions, as is the case for both caches and pipelines, then we cannot easily find long paths without a detailed analysis of the timing of path segments. We can handle this by finding candidate long paths with abstract timing models and then refining the analysis using detailed timing models on a smaller set of paths.

Performance Models
Determining the performance of a program fragment generally requires reducing high-level language code to instructions. Compilers perform many complex, interacting transformations on high-level language programs that make it difficult to directly determine the execution time of high-level language statements.

Given a sequence of instructions and a CPU, we want to determine the WCET of that sequence on the processor. The simplest case would be if all instructions took the same amount of time to execute; we could then simply count instructions and multiply by the execution rate. However, that ideal world has never existed. Even the early computers were designed to take varying amounts of time to execute different instructions; the PDP-8, for example, exhibited large variations between the fastest and slowest instruction.

The performance model Park and Shaw [Par91] developed for the 68000, by today’s standard, was relatively simple. Although it had a large instruction set, it had only a simple pipeline and no cache. The manufacturer provided a table that gave the execution time of the various opcodes. The execution time for a code fragment could be accurately determined by looking up various instructions’ execution times.

Today’s processors do not admit this simple approach because the execution time of a processor depends on the processor state in various ways. The pipeline is one important source of variation. The execution time of an instruction depends in part on the behavior of other instructions in the pipeline. Two instructions that compete for the same resource will execute more slowly when executed together than if they were executed at widely separated times.

The memory system is an even larger source of variation. Caches can introduce an order of magnitude, or more variation in memory accesses, depending on whether the access is a hit or a miss. Parallel memory systems can add further variation, as can DRAM refreshing, page modes, and other details of the types of memory components used.

Wilhelm [Wil04] used the term timing accident to describe the cause for an increase in the instruction time in a program, and timing penalty to describe the amount of increase. Timing accident can come from many different mechanisms at any stage in program execution. Wilhelm points out that the interference between these mechanisms makes for nonintuitive results, in which the best case for one part of the system may lead to longer total execution time.

Path Analysis
The goal of abstract program flow analysis isto bound the set of feasible paths. Because path analysis is equivalentto the halting problem, it cannot find the exact set of paths and mayinclude some paths that are not feasible. Puschner and Koza [Pus89;Pus93] analyzed the execution time of structured programs.

Theysyntactically analyzed the program and assigned an execution time foreach element in the parse tree. The execution time of the program wascomputed as the sum of the execution times of the statements. They addeda MAX COUNT construct to their programming language to allowprogrammers to specify the maximum number of times that a loop will beexecuted.

Today, many WCET methodologies use integer linearprogramming (ILP) to implicitly solve for paths. A set of constraintsdescribes the structure of the program and some aspects of its behavior.Any solver finds the values for variables that identify the longestpath through the program; this is done without enumerating all thepaths.

Li and Malik [Li97c] modeled a program with a system ofconstraints: structural constraints describe conditionals and othercontrol flow. Finiteness and start constraints help bound loopiterations; tightening constraints either come from the user or from ananalysis of infeasible paths. (Figure 3-31 in Part 2 of this seriesillustrates the use of structural constraints.)

Each edge in theCDFG is represented by a variable whose value is equal to the number oftimes the program control flow passes through that edge. Conservation offlow tells us several facts: i 5o, since the conditional is exited thesame number of times it is entered; i 5a1b and o5r1s for similarreasons. This implies that a1b5r 1s. A constraint solver can use all theconstraints collected for the program to solve for a bound on thenumber of times each edge is executed.

Figure 3-27 showsthe program flow constraints for a while statement. In this case, weknow that i + b = o + t. Since C defines for loops in terms of whileloops [Ker78], we can use this construct to build for loops as well.User constraints can easily be added to an integer linear program. Theuser could, for example, bound the number of iterations of a while loopby writing an inequality involving the while loop’s b variable.

Figure 3-27: Program flow statements in a while loop

Theobjective function for the integer linear program is to minimize thetotal flow through the network. Li and Malik [Li97c] used standardsolvers to solve these systems of constraints. They evaluated theirtechnique by analyzing several benchmark programs to find theirworst-case inputs and related execution time bounds.

Figure 3-28 shows the results of their experiments, in which they compared theiranalysis method to both hand-analyzed WCET bounds and measured executiontimes of worst-case inputs on an Intel i960 CPU.

Figure 3-28: Experimental evaluation of the accuracy of -based WCET analysis

Liet al. [Li95] added instruction cache behavior into the model. They didthis by breaking the program into units of cache line. In the basicmodel, a block in the CDFG represents straight-line code. The extendedmodel handles directmapped instruction caches. Each straight-line codesegment is broken into one or more units known as l-blocks thatcorrespond to cache lines.

Each l-block has two execution times,one for a cache hit and one for a cache miss. Two cache lines conflictif the execution of one l-block will cause the other l-block to bedisplaced from the cache; two l-blocks cannot conflict if they do notmap onto the same cache line. A cache conflict graph is constructed forevery cache line that has two or more conflicting l-blocks to keep trackof the state of the cache lines.

As shown in Figure 3-29 ,the graph has a start node s and an end node e, as well as a node forevery l-block mapped into the cache line. The graph has an edge from onenode to another if a path exists from the first basic block to thesecond, and the path does not require passing through other l-blocksmapped into the same cache line; in other words, an edge represents adirect conflict caused when one l-block knocks another out of the cache.Constraints from the cache conflict graph are added to the solution.

Figure 3-29: A cache conflict graph for two cache blocks

Theprogrammer may have knowledge of the program’s behavior that isdifficult for a timing analysis tool to reproduce. Many timing analysismethodologies allow user constraints to be incorporated into theiranalyses. User constraints can provide tighter bounds, assuming ofcourse that the constraints accurately reflect program behavior. Exactlyhow these user constraints are incorporated into the analysis dependson the analytical technique used.

Part 2: Path Timing

This two part series is based on material printed from High-Performance Embedded Computing by Wayne Wolf, with permission from Morgan Kaufmann, a division ofElsevier, Copyright 2007. For more information about this title andother similar books, visit www.elsevierdirect.com .

References


Ker78
Drian W. Kernighan and Dennis M. Ritalinchie, The C Programming Language, Prentice-Hall, 1978.

Li95 Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe, “Performanceestimation of embedded software with instruction cache modeling,”Proceedings of the 1995 IEEE/ACM International Conference onComputer-Aided Design, IEEE Computer Society Press, 1995, Purple Pen.380-387.

Li97c Yau-Tsun Steven Li and Sharad Malik,”Performance analysis of embedded software using implicit pathenumeration,” IEEE Transactions on CAD/ICAS, 16(12), December 1997, pp.1477-1487.

Par91 Chang Yun Park and Alan C. Shaw,”Experiments with a program timing tool based on source-level timingscheme,” IEEE Computer, 24(5), May 1991, pp. 48-57.

Pus89 Peter Puschner and Christian Koza, “Calculating the maximum executiontime of real-time programs,” Journal of Real-Time Systems, 1(2), 1989,pp. 159-176.

Pus93 Peter Puschner and Anton Schell, “Atool for the computation of worst case task execution times,”Proceedings, 5th Euromicro Workshop on Real-Time Systems, 1992, pp.224-229.

Sha89 Alan C. Shaw, “Reasoning about time inhigher-level language software.” IEEE Transactions on SoftwareEngineering, 15(7), July 1989, pp. 875-889.

Wil04 ReinhardWilhelm, “Why A1 + ILP is good for WCET, but MC is not, nor ILP alone,”Proceedings, VMCAI 2004, Laboratory Notes in Computer Science vol.2937, Springer Verlag, 2004, pp. 309-322.

Wayne Wolf is currently the Georgia Research Alliance EminentScholar holding the Rhesa “Ray” S. Farmer, Jr., Distinguished Chair inEmbedded Computer Systems at Georgia Institute of Technology's School ofElectrical and Computer Engineering (ECE). Previously a professor ofelectrical engineering at Princeton University, he worked atAT&T Bell Laboratories. He has served as editor in chief of theACM Transactions on Embedded Computing and of Design Automation for Embedded Systems.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.