The fundamentals of software performance analysis: Part 1 - Path analysis
Editor’s Note: In the first in a two part series from High Performance Embedded Computing, author Wayne Wolf discusses the best means by which to measure worst-case, best-case, and average-case execution time in evaluating software performance. Part 1: Path analysis
Just as we need to know the speed at which hardware modules execute to design a hardware system, we need to analyze the performance of programs to build complex software systems. Unfortunately, software performance analysis is much more difficult than hardware timing analysis.
Synchronous hardware design restricts the structure of digital logic so that we can accurately bound delay with reasonable effort. Software designers, in contrast, insist on the full power of Turing machines, which makes it much more difficult to find tight bounds on execution time.
We generally measure program execution time from program initiation at presentation of some inputs to termination at the delivery of the last outputs. Several different measures of software performance are of interest.
- Worst-case execution time (WCET)—the longest execution time for any possible combination of inputs
- Best-case execution time (BCET)—the shortest execution time for any possible combination of inputs
- Average-case execution time for typical inputs
Average-case performance is used for very different purposes than are worstcase and best-case execution times. Average performance is used to tune software (and perhaps the underlying hardware platform); average performance data is particularly useful when it is recorded not just for the entire program but for program units. Average-case behavior can be used to find hot spots due to poor algorithms, bad coding, poor choice of instructions, or other causes.
Average performance is typically evaluated using CPU simulators.. Worst-case and best-case execution times, in contrast, are used during schedulability analysis. Many scheduling algorithms and analysis methods guarantee schedulability based on knowledge of execution time.
Many scheduling algorithms assume that the execution time of a program is constant, which is not realistic given the data-dependent behavior of most interesting programs. WCET is often used as a substitute for exact execution time. However, short execution times can cause scheduling problems that do not occur when all programs run to the worst-case limits, so a more helpful strategy is to measure both worst-case and best-case execution times.
We cannot bound execution times through simulation because interesting programs have far too many input combinations to enumerate. We need to analyze the program to determine its timing behavior. When analyzing hardware to determine its maximum clock rate, we rely on the fact that combinational logic is generally acyclic to simplify analysis. When we analyze programs, we must handle arbitrary control flows. As illustrated in Figure 3-26, performance analysis algorithms must handle not only branches but also loops and other types of cyclic control flow.
The basic approach to worst-case execution time analysis still followed today was suggested by Shaw [Sha89] and extended by Park and Shaw [Par91]. They broke WCET analysis into two phases:
- Analyze the program to find the worst-case path through a program. This problem is sometimes called path analysis.
- Measure the execution time along the worst-case path. Let us call this problem path timing.
The initial step in this methodology was to efficiently identify the worst-case execution path through the program. Park and Shaw developed a performance model of a 68000 that gave them good results. Unfortunately, modern high-performance microprocessors are much more complex, and the timing of instruction sequences is much more difficult. Research over the past 20 years has improved both the path identification process and the processor performance models.
Path analysis and path timing interact to some extent for complex machines. If the execution time of an instruction depends on the previous instructions, as is the case for both caches and pipelines, then we cannot easily find long paths without a detailed analysis of the timing of path segments. We can handle this by finding candidate long paths with abstract timing models and then refining the analysis using detailed timing models on a smaller set of paths.
Determining the performance of a program fragment generally requires reducing high-level language code to instructions. Compilers perform many complex, interacting transformations on high-level language programs that make it difficult to directly determine the execution time of high-level language statements.
Given a sequence of instructions and a CPU, we want to determine the WCET of that sequence on the processor. The simplest case would be if all instructions took the same amount of time to execute; we could then simply count instructions and multiply by the execution rate. However, that ideal world has never existed. Even the early computers were designed to take varying amounts of time to execute different instructions; the PDP-8, for example, exhibited large variations between the fastest and slowest instruction.
The performance model Park and Shaw [Par91] developed for the 68000, by today’s standard, was relatively simple. Although it had a large instruction set, it had only a simple pipeline and no cache. The manufacturer provided a table that gave the execution time of the various opcodes. The execution time for a code fragment could be accurately determined by looking up various instructions’ execution times.
Today’s processors do not admit this simple approach because the execution time of a processor depends on the processor state in various ways. The pipeline is one important source of variation. The execution time of an instruction depends in part on the behavior of other instructions in the pipeline. Two instructions that compete for the same resource will execute more slowly when executed together than if they were executed at widely separated times.
The memory system is an even larger source of variation. Caches can introduce an order of magnitude, or more variation in memory accesses, depending on whether the access is a hit or a miss. Parallel memory systems can add further variation, as can DRAM refreshing, page modes, and other details of the types of memory components used.
Wilhelm [Wil04] used the term timing accident to describe the cause for an increase in the instruction time in a program, and timing penalty to describe the amount of increase. Timing accident can come from many different mechanisms at any stage in program execution. Wilhelm points out that the interference between these mechanisms makes for nonintuitive results, in which the best case for one part of the system may lead to longer total execution time.