The fundamentals of software performance analysis: Part 2 - Path timing - Embedded.com

The fundamentals of software performance analysis: Part 2 – Path timing

Editor’s Note: In the second of a two part series from High Performance Embedded Computing ,  author Wayne Wolf discusses the best means by which to measure worst-case, best-case, and average-case execution time in evaluating software performance. Part 1: Path analysis

Several techniques are used to analyze path timing at different levels of abstraction. Abstract interpretation analyzes the behavior of the program in detail to more accurately understand the execution states of the program. Data flow analysis provides a more detailed view of how the program behaves. The most concrete technique is simulation, which is used to determine the details of how the processor responds to the program.

Bounding the number of iterations taken by a loop is particularly important for WCET analysis given that much of the execution time of most programs is spent in loops. Some loops, such as the FIR filter, are easy to analyze.

for (i = 0; i < N; i+)
  y [i ] = c [i ] * x [i ];

But loops with conditionals create problems.

for (i = 0; i < N; i++)
  for ( j = 0; j < M; j ++)
    if (i ! = j )
      c[ i ][j ] = a[i ]*b[ i ][j ];
    else
      c[i ][j ] = 0;

The algorithm of Healy et al [Hea99a] bound loop iterations in four phases:

  1. It first uses an iterative algorithm to identify branches that affect the number of iterations.
  2. It then identifies the loop iteration on which a loop-dependent branch changes direction.
  3. It determines when the branches found in step 1 are reached.
  4. Finally, it uses these results to calculate the iteration bound.

Figure 3-30 shows an example of loop iteration bounding from Healy et al [Hea99a] . The loop can be exited by a complex condition.

Figure 3-30: An example of loop iteration bounding

In the first step, it identifies the four control flow boxes with jumps as the points at which loop-dependent branches can change direction. It then determines that the first conditional branch, j > 75, occurs on iteration 26 (since j is incremented by 3 on each iteration). The condition j > 300 is taken on iteration 101, as is the condition j < 100. This gives a bound on the number of iterations as a minimum of 26 and a maximum of 101. Vivancos et al. [Viv01] proposed a parametric timing analysis method to determine the timing behavior of programs with indefinite loops. They iteratively test the worst-case execution time of the program, starting with zero iterations and increasing the number of iterations by one at each step. Their testing terminates when the WCET stabilizes. Figure 3-31 shows a simple conditional, its associated CDFG, and the flow variables assigned to the edges in the CDFG.

Figure 3-31: Program flow constraint in an if statement

Ermedahl et al. [Erm05] use a clustering technique to determine which parts of a program must be handled as a unit. They annotate the control flow graph of the program with flow facts, which include the defining scope, a context specifier, and a constraint expression. The constraints may involve execution count variables and constants. As illustrated in Figure 3-32, they cluster facts together to find the maximum scopes over which flow facts apply, such as in nested loops.

Click on image to enlarge.

Figure 3-32: Example code (top) and flow facts and clusters (bottom)

Healy et al. [Hea99b] use static cache simulation to categorize each instruction’s behavior in the cache. Given the CDFG for the program, they use an iterative algorithm to determine the program lines that may be in the cache during the entry and exit of each basic block. They categorized the behavior of instructions in the instruction cache.

Worst-case categories include the following:

  • Always miss—Not guaranteed to be in the cache when referenced.
  • Always hit—Guaranteed to be in the cache when it is referenced.
  • First miss—Not guaranteed to be in the cache on the first loop iteration but guaranteed to be in the cache on subsequent iterations.
  • First hit—Guaranteed to be in the cache on the first iteration but not guaranteed on subsequent iterations.

Best-case categories include the following.

  • Always miss—Guaranteed not to be in the cache when first referenced.
  • Always hit—May be in the cache every time it is referenced.
  • First miss—Guaranteed not to be in the cache on the first loop iteration but may be on later iterations.
  • First hit—Instruction may be in the cache on the first loop iteration but is guaranteed not to be in the cache for later iterations.

To analyze the behavior of the instructions in the pipeline, Healy et al. used a table to describe, for each type of instruction, worst-case and best-case number of cycles required at every stage of the pipeline.

They analyzed the iterations in a loop iteration or single invocation of a function, which could consist of multiple basic blocks. Their algorithm for determining the worst-case timing along a path is shown in Figure 3-33 .

Figure 3-33: Algorithm for worst-case path analysis

To analyze the behavior of the instructionsin the pipeline, Healy et al. used a table to describe, for each type ofinstruction, worst-case and best-case number of cycles required atevery stage of the pipeline.

They analyzed the iterations in aloop iteration or single invocation of a function, which could consistof multiple basic blocks. Their algorithm for determining the worst-casetiming along a path is shown in Figure 3-33 .

Figure 3-33: Algorithm for worst-case path analysis

Abstract interpretation executes a program usingsymbolic values rather than specific numeric values [The99]. Thistechnique allows us to generalize information about program executionacross sets of variable values in a reasonable amount of computationtime. Abstract interpretation first determines concrete semantics for aprogram that describes a particular aspect of the program’s behavior,such as cache or pipeline behavior.

Based on the concretesemantics, an abstract semantic is built that operates on the program’sabstract domain. An element of the abstract domain of the programrepresents a set of members of the concrete domain. An abstract staterepresents a set of concrete states; the abstract representation may beconservative in that it includes concrete states that do not actuallyoccur.

The concrete semantics for cache state represent how cachelines map into memory blocks. The abstract cache states map a cacheline to a set of cache blocks. Theiling et al. described three types ofanalysis of this abstract state: must analysis, may analysis, andpersistence analysis.

Must analysis looks at the upper bounds ofthe ages of the memory blocks. When combining two cache states, a memoryblock stays in the abstract cache only if it is in both predecessorsand it receives the larger of their ages. This operation is similar toset intersection.

Mayanalysis looks at the lower bounds on theages of the memory blocks. A block is in the abstract cache if it is inone predecessor and it receives the youngest of the cache ages.Persistence analysis does not know whether the first access will be ahit or a miss but does guarantee that succeeding accesses will be hits.

Wilhelm[Wil04] argues for a modular approach to timing analysis in whichabstract interpretation is used to analyze processor behavior, andinteger linear programming is used to analyze program path behavior.Abstract interpretation helps to exclude timing accidents that cannotoccur and provides the worst-case execution time for basic blocks,within their execution context. ILP determines an upper bound onexecution time and the associated path.

Some aspects of programperformance require detailed understanding of the processor state. Solong as these effects can be isolated to small segments of the program,simulation is a reasonable and valuable tool for detailed WCET analysis.Engblom and Ermedahl [Eng99a] used simulation to analyze pipelineeffects. They break the execution time of a code sequence into theindividual execution times plus the timing effects for subsequences.Timing effects capture the interactions between instructions.

Engblomand Ermedahl use a cycle accurate simulator to determine the executiontime of a sequence of instructions. They do not use static analysis todetermine whether combinations of instructions should have timingeffects—a CPU can exhibit a large number of interinstructiondependencies that could depend on data values and may not be welldocumented.

Instead, Engblom and Ermedahl use the simulator todiscover timing effects by executing successively longer code sequences.The length of a sequence that is tested is bounded by the length oftime, typically equal to the length of the pipeline, in which the CPUcan exhibit inter-instruction pipeline effects. Healy et al. [Hea99b]use tables of structural and data hazards to analyze pipelineinteractions in instruction sequences.

Engblom and Jonsson[Eng02] developed a model for single-issue in-order pipelines. Given apipeline with n stages, an instruction takes a certain number of clockcycles in each stage, denoted as

 

depending on the resources it needs. Asequence of m instructions in the pipeline obeys two constraints, as

          (EQ-11)

where is the time at which instruction i enters pipeline stage j ; and

           (EQ-12)

whichconstrains the next instruction from entering a stage until the lastinstruction has finished. Branch instructions generate dependenciesbetween the stage at which the branch is decided and instructionfetching:

          (EQ-13)

Data dependencies generate additional dependencies, as

          (EQ-14)

inwhich the data dependency is between instructions i and k , and the k instruction must finish stage l before the i instruction can proceed.Longest-path algorithms can be used to solve these systems ofconstraints.

Engblom and Jonsson used this formulation todetermine which types of pipelines have long timing effects. Theydefined a crossing critical path and showed that pipelines with thisproperty do not have long timing effects that increase execution time.

Theyshowed that a single in-order pipeline has the crossing critical pathproperty if each constraint introduced by branching and datadependencies either occurs between adjacent instructions or is subsumedby the basic constraints of (EQ 3-11) and (EQ 3-12).

Pipelineswith forks do not in general exhibit the crossing critical pathproperty, and such pipelines are widely used in high-performanceprocessors. For example, most processors with a separate floating-pointpipeline do not have the crossing critical path property.

Part 1: Path analysis

This two part series is based on material printed from High-Performance Embedded Computing by Wayne Wolf, with permission from Morgan Kaufmann, a division ofElsevier, Copyright 2007. For more information about this title andother similar books, visit www.elsevierdirect.com .

References:

Eng02 Jakob Engblom and Bengt Jonsson, “Processor pipelines and theirproperties for static WCET analysis,” Proceedings of the Second EmbeddedSoftware Conference, Lecture Notes in Computer Science, vol. 2491,2002, Springer Verlag.

Eng99a Jakob Engblom and AndreasErmedahl, “Pipeline timing analysis using a trace-driven simulator,” inProceedings, Sixth International Conference on Real-Time ComputingSystems and Applications (RTSCA '99), IEEE, 1999, pp. 88-95.

Erm05 Andreas Ermedahl, Friedhelm Stappert, and Jakob Engblom, “Clusteredworst-case execution-tie calculation,” IEEE Transactions on Computers,54(9), pp. 1104-1122.

Hea99b Christopher A. Healy, RobertD. Arnold, Frank Mueller, David B. Whalley, and Marion G. Harmon,”Bounding pipeline and instruction cache performance,” IEEE Transactionson Computers, 48(1), January 1999, pp. 53-70.

The99 Henrik Theiling, Christian Ferdinand, and Reinhard Wilhelm, “Fast andprecise WCET prediction by separated cache and path analyses,” Real-TimeSystems, 5(1), 1999, pp. 1-22.

Viv01 Emilio Vivancos,Christopher Healy, Frank Mueller, and David Whalley, “Parametric timinganalysis,” Proceedings of the ACM SIGPLAN Workshop on Languages,Compilers, and Tools for Embedded Systes, ACM Press, 2001, pp. 88-93.

Wil04 Wil04 Reinhard Wilhelm, “Why A1 + ILP is good for WCET, but MC is not,nor ILP alone,” Proceedings, VMCAI 2004, Laboratory Notes in ComputerScience vol. 2937, Springer Verlag, 2004, pp. 309-322.

Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding theRhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded ComputerSystems at Georgia Institute of Technology's School of Electrical andComputer Engineering (ECE). Previously a professor of electricalengineering at Princeton University, he worked at AT&T BellLaboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing and of Design Automation for Embedded Systems.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.