Tutorial: Techniques for measuring execution time and real-time performance - Part 1 - Embedded.com

Tutorial: Techniques for measuring execution time and real-time performance – Part 1

Many embedded systems require hard or soft real-time execution that mustmeetrigid timing constraints. Further complicating the issue is that for avariety of reasons, most of these same embedded systems have verylimited processing power; it is not uncommon for them to be using an8-bit or 16- bit processor operating at 10 MHz or less.

Real-time systems theory advocates the use of an appropriatescheduling algorithm and performing a schedulabilityanalysis prior to building the system. Adherence to thistheoryalone does not lead to working embedded systems, and thus use of thistheory is often dismissed by practitioners.

Practitioners, on the other hand, spend days – if not weeks – oftestingand debugging hard-to-find anddifficult-to- replicate problems because their system is not performingto specifications. Often, these problems are related to the system'stiming, because functional testing was done using good tools, and thesystem usually produces a correct response.

There exists a balance between theory and practice, where properdesign of real-time code enables the real-time analysis of it.Systematic techniques for measuring execution time can then be usedalongside the guidelines provided by real-time systems theory to helpan engineer design, analyze, and if necessary quickly fix timingproblems in real-time embedded systems.

This series of two articles discusses techniques for measuring andoptimizing real-time code, and analyzing performance by correlating themeasurements with the real-time specifications through use of real-timesystems theory. Since this paper is directed towards practitioners,simple rules of thumb that encapsulate the knowledge of complextheories and proofs are presented.

Several other activities of the development process can benefit fromestimating and measuring execution time using the methods describedhere. This includes debugging hard-to-find timing errors that result inhiccups in the system, estimating processing needs of software, anddetermining the hardware needs when enhancing functionality of anexisting system or reusing code in subsequent generations of embeddedsystems.

Overview of Measurement Techniques
Many different methods exist to measure execution time, but there is nosingle best technique. Rather, each technique is a compromise betweenmultiple attributes, such as resolution, accuracy, granularity, anddifficulty. A summary of the key attributes follows:

Resolution isa representation of the limitations of the timing hardware. Forexample, a stop watch measures with a 0.01 sec resolution, while alogic analyzer might be able to measure with a resolution of 50 nsec.

Accuracy is the closeness of the measured value using a given method ofmeasuring, as compared to the actual time if a perfect measurement wasobtained. If a particular measurement is repeated several times, thereis usually some amount of error in the measurements. Thus, measurementscould yield answers of the form x +/- y. In this case, y is theaccuracy of the measurement x.

Granularity is the part of the code that can be measured, and usually specified ina subjective manner. For example, coarse granularity (also calledcoarse-grain) methods would generally measure execution time on aper-process, per-procedure, or per-function basis.

In contrast, a method that has fine granularity (also calledfine-grain) can be used to measure execution time of a loop, small codesegment, or even a single instruction. Important to note is that somefine-grain techniques can also be used to perform coarse-grainmeasurements, although the effort in doing so could be much greaterthan using a coarse-grain method.

Difficulty subjectively defines the effort to obtain measurements. A method thatrequires the user to simply run the code and it produces an instantanswer or a table of results is considered easy. A method that requiresusage of instrumentation such as a logic analyzer and filtering of datato obtain the answers is considered hard.

Typically, software-only methods are easier, but yield onlycoarse-grain results. Hardware-assisted methods are hard, but they canprovide fine-grain results with high accuracy.

Table1: Summary of methods to measure execution time

Table 1, above summarizesthe methods and attributes of each method as presented in this paperNote that in many cases the attributes are approximated or subjective,not exact values; however comparing the attributes of different methodsshould provide sufficient information to help choose the bestmeasurement for a particular need.

The method of choice can also depend on the hardware features andinstrumentation tools available. For example, some methods requirespecial hardware features like a digital output port, while othertechniques require a specific software application or measurementinstrumentation to be available. In some cases, the hardware or toolsneeded can be quite expensive and the cost and lack of availability canprevent using a particular method.

On the other hand, having access to the right tools cansignificantly decrease the amount of effort needed to obtain neededmeasurements, and thus obtaining the tools most suited to a project'sneeds could be a worthwhile investment.

The design of the software can also have a major impact on theability to obtain measurements of execution time, but it is notclassified as an attribute, as there is no way to quantify or qualifyevery possible variation.

In particular, the execution time of software designed in an ad-hocmanner (also known as “spaghetti code“)is very difficult to measure because the starting and stopping pointsof the code are not easy to identify, and if there are multiple andinconsistent entry or exit points to the same piece of code, thenobtaining accurate measurements is near impossible.

On the other hand, software designed so that it is “analyzable”clearly has a single entry and exit point for any part of it that needsto be measured, and those entry and exit points are definedconsistently for all code segments that have similar functionality.

Selecting a Method
To select which measurement method to use, first consider the reasonfor measuring execution time. The most common reasons for measuringexecution time are to refine estimates, optimize code, analyzereal-time performance, and to debug timing errors.

Refining estimates is usually done during the design phase or earlyin the implementation phase. The estimates might be used to selectwhich processor to use, or to obtain ballpark figures on how manyiterations of a particular function can be executed per second.

Coarse-grain measurements can provide some of these answers fairlyquickly. Sometimes, the measurements can even be made on the hostprocessor, with an approximated scale factor for the target processor(such as “embedded processor X is about 18 times slower than hostprocessor Y”.)

Optimizing code could use coarse-grain methods or fine-grainmethods, depending on what is being optimized. If optimization is at aglobal scale, such as deciding whether it would be faster to use arraysor linked lists in a particular application, then a coarse-graintechnique to measure execution time of complete functions is usuallysufficient.

On the other hand, for localized optimizations, such as those thatare specific to a target processor and occur during the late stages ofdevelopment or when trying to fine-tune an application, a fine-graintechnique that can measure execution time of a single line of code isusually needed.

Analyzing real-time performance can use a coarse-grain technique,but often only fine-grain techniques can provide the necessaryaccuracy. The accuracy needs to be at least five to ten times fasterthan the period of the fastest task.

Thus, if the fastest task in the system has a period of 10 msec,then a measurement technique that provides an accuracy of at least 1 to2 msec for functions is needed to provide fairly good answers. Moreaccuracy is better, especially if the Central Processing Unit (CPU) iseither overloaded or operating at almost 100% utilization. In thesecases, a technique with microsecond accuracy is needed.

Debugging timing errors usually needs a fine-grain method withmaximum resolution. It is often necessary to measure not only usercode, but also real-time operating system (RTOS) code, and to detectany anomalies that might be occurring, such as missed deadlines ortasks not executing at the desired rate.

Measurement Methods
Of the measurement techniques that were summarized in Table 1, a few methods are quitestraightforward, but most are only applicable to UNIX-based systems,such as most embedded versions of Linux.

The software analyzer method is an all-encompassing description ofsome features provided by commercial RTOS and tools. The techniquesdescribed towards the end of this tutorial are the ones based onhardware,and can be used independent of the RTOS. These can provide the mostaccurate results, but also involve the most complexity.

Stop-watch. A stop watch is only suitable for non-interactive programs, preferablyrunning on single-tasking systems. It can be used to measure time ofthings like numerical code which may take minutes or hours to execute,and when measurements only need to be approximations (e.g. to nearestsecond).

The method simply involves using the chronograph feature of adigital wrist-watch (or other equivalent timing device). When theprogram starts, start the watch. When the program ends, stop the watch,and read the time.

Date command. Thedate command is useful when using a UNIX-based system or any other RTOSthat has a command that displays the current date and time.

The date command is used like a stopwatch, except it uses thebuilt-in clock of the computer instead of an external stopwatch. Thismethod is more accurate than a stop-watch, but has the same granularityof only being able to accurately measure non-interactive processes.

A typical way to use the command is to wrap the program that isbeing measured in a shell script or alias with the following commands:

date > output
program >> output
date >> output

As with the stop-watch method, this will only provide an estimate ofhow long the full program takes to execute. It does not take intoconsideration preemption, interrupts, or I/O. Most accurate answers areobtained on non-preemptive systems. This method is useful if the outputserves as a log, so that the start and end time of each execution islogged into the file. A sample use is for long simulations that run inthe background overnight, and it can provide information to knowprecisely when it ended.

Time command (UNIX). Thetime command is useful when using a UNIX-based system. Other RTOSesmightprovide a similar command. Execution time measurement is activated byprefixing time to a command line. This command not only measures thetime between beginning and end of the program, but it also computes theexecution time used by the specific program, taking into considerationpreemption, I/O, and other activities that cause the process to give-upthe CPU.

The output depends on which version of the time command is beingused. In some cases, the time command is part of the shell. In othercases, it can be found in /usr/bin/time. In each case the output is thesame information, just the format is different. For example:

% time program
8.400u 0.040s 0:18.40 56.1%

Interpreting the output, the first item (with a u appended, u=CPU ), is theexecution time of program, shown here as 8.4 sec. This is the amount oftime the CPU was actually executing the program. Any time spentpreempted, blocked for I/ O, or performing RTOS functions is excluded.

The second item (with s appended,s=system ), is the execution time used by the RTOS while runningthe program. This includes execution time for items such as devicedrivers, interrupt handlers, or other system calls directly associatedwith the program. The example shows that 0.04 sec of execution time wasfor system functions.

The third item is the total time that the program was executing inthe system, whether it be running or blocked or waiting on the readyqueue. In this case, it was 18.4 sec. This time is the about the sametime that would be reported using the date method above.

The fourth item is the average percentage of CPU time used when thetask was ready or running. The value primarily depends on the load ofthe system, and has little meaning as far as measuring execution time.

Prof and Gprof (UNIX) . The previous methods canonly be used to measure a complete program. Many times, it is necessaryto measure execution time at a finer granularity.

One method to measure execution on a per function basis is to usethe prof or gprof profiling mechanisms available in UNIX. Profilingmeans to obtain a set of timing measurements for all (or a large part)of the code. The granularity of a profile depends on the method. Inthis case, both prof and gprof measure execution time with thegranularity of a function. The resolution is usually that of the systemclock, meaning on the order of 10 msec.

Both prof and gprof do similar things, exceptthat gprof gives much moredetailed results than prof. The measured time properly takes intoaccount preemption, such that if a process is preempted, the clockstops until the process starts to execute again. This profilingmechanism, however, does slow down execution of the program by anon-negligible amount.

So the execution time measured when using prof or gprof will be greater than the realexecution time of the program when it is not being profiled. Despitethis inaccuracy, the method can be useful to identify which functionsin the program are using the most execution time, to identify whereoptimizations might need to be made the most.

To use prof , compile withthe “p option then run program as follows (other compiler options canbe used too, this is just an example).

% gcc “p -o program program.c
% program

When the program terminates, the file mon.out is automaticallycreated. It is a binary file that contains the timing data by functionfor the program. To view the timing data, type the following:

% prof program

A more detailed profile report can be obtained using gprof, by compiling with the “pgoption as follows:

% gcc -pg -o program program.c
% program

Running the program creates the file gmon.out, which can be viewedas follows:

% gprof program

For information that describes the format of the statistics and whateach entry means, look at the online UNIX manuals for prof and gproffor the specific operating system version being used.

Clock(). Although the prof/gprof method provides moredetailed information then the first few methods presented, it is oftennecessary to measure execution time with finer granularity than afunction.

Suppose prof was used andit shows that 90% of the time is spent in one subroutine. Thatsubroutine becomes the primary target for optimization. But if theroutine includes several loops, the next step is then to identify themost time-consuming parts within that subroutine.

A possible approach is to use the clock() function, as provided by many operating systems, including UNIX. Inthis case, however, the program must be instrumented such that theclock is read at the beginning and end of the code segment( s) beingmeasured.

Instrumenting the code means adding lines of code explicitly toperform the timing measurements. Such lines of code are temporary, andare removed once the desired data has been collected.

This method is useful for fine-grain measurements, such as a codesegment or loop, but it is not as convenient as prof/ gprof to obtainmeasurements of multiple functions or processes at once. Here is anexample of a program that uses clock() .

    #include
    clock_tstart,finish;
    double total;
    start = clock();
    do stuff;
    finish = clock();total = (double) (finish – start) / (double) CLK_TCK
    printf(“Total =%fn”,total);

There are several issues that must be taken into account when using clock() . The issues stem from thefact that there is no standard implementation of this function, thus itcan produce different results for different operating systems.

For example, it can provide a value in microseconds, seconds, orclock ticks. The reference manual for the particular operating systemshould be referenced prior to using theclock() function.

Depending on the system, clock() mightbehave differently if the system is preemptive. In some cases, if thetask is preempted, the value returned by clock() will include the time spentby the other task too. In other cases, it will only include time usedby its own process.

The clock() function iscertainly more useful when the implementation properly deals withpreemption. But even if it does not, see descriptions in the followingsections on how to deal with preemption.

It is also important to note the resolution. Even though clock() might report time inmicroseconds, the resolution is usually the same as the system clock,which can be computed as 1/sysconf (3). On many UNIX systems, this is10 msec or longer. Calling the function sysconf() with the argument '3'returns the value of the system clock.

If more resolution than 10 msec is needed, then one of twoapproaches can be used:

1) Create a loop around whatneeds to be measured, that executes 10, 100, or 1000 times or more.Measure execution time to the nearest 10 msec. Then divide that time bythe number of times the loop executed. If the loop executed 1000 timesusing a 10 msec clock, you obtain a resolution of 10 µsec for theloop.

2) Use a hardware-basedmethod.

The advantage of the loop method is that it does not require anyspecial hardware. The disadvantage is that it forces a change in thecode; the change might affect the functionality, and could even causethe program to crash.

At the very least, the code slows down by the number of iterationsperformed just to get a reading, and thus real-time performance islost. If this is not acceptable, then one of the other methods must beused.

Software Analyzer. The termsoftware analyzer is used as an all-encompassing phrase for softwaretools provided by a variety of RTOS and tool vendors designedspecifically for measuring execution time. Examples include TimeTrace [6], and WindView [7].

It is beyond the scope of this tutorial to describe how to use anysuch tools, or to even recommend one tool over the other. Rather, thissection provides a general discussion to aid in understandingcapabilities of these tools.

The first step in using a software analyzer is to determine theresolution and granularity. The resolution should be one of thespecifications of the product. It can also be determined experimentallyby slowly increasing execution time of a code segment, then monitoringthe measured value by the smallest time increment. That typically isthe resolution.

If the software analyzer is based on the system clock, then theresolution will likely be on the order of a millisecond. If theanalyzer is based on some other hardware-based method, such as using anonboard timer/counter chip, then the resolution might be in themicroseconds range.

The granularity is another important item to identify. Some softwareanalyzers will be like prof/gprof, and only be able to provideinformation on a per-function or per-process basis. As with prof/gprof,such analyzers are good if coarse-grain measurements are satisfactory,but not very useful when optimizing localized code segments or trackingdown timing or synchronization errors.

A good software analyzer will not only provide information onper-function or per-process basis, but it will also contain a means formeasuring execution time for smaller segments, such as a loop, block ofcode, or even a single statement. The ability to measure execution timeof interrupt handlers and the RTOS overhead are also a bonus.

Some software analyzers provide a timing trace to show preciselywhat process is executing at what time. Such a timing trace could behelpful to an expert when debugging timing and synchronization errors,but they do not offer data in a convenient format to analyze real-timeperformance.

Also, if the timing trace is not correlated to the source code, thenit is not possible to identify what part of code is responsible forextended periods of execution when such an event is detected in thetiming trace. Instead, a state mode that provides tabular data that canbe analyzed or download is needed.

Another issue to consider when using software analyzers are theresources used. Some analyzers add overhead, and thus slow down code.Most analyzers require lots of memory to log data, making the toolineffective when an embedded system's memory is already fullyallocated. In such cases, the hardware-based methods described belowcan instead be used.

Timer/Counter Chip. Mostembedded computers have timer/counter chips that are user programmable. If such a chip is available,then it can be used to obtain fine-grain measurements of code segments.The method presented here, however, is not very useful for coarse-grainmeasurements, such as total execution time used by a function orprocess.

This method is similar to using the clock() method describedearlier, in that the starting and stopping points of the code beingmeasured are instrumented directly into the code. At the beginning ofthe code segment, the current countdown (or count-up) value of thetimer/counter is read. At the end of the code, the value is read again.The difference between these two values represents how many timer tickshave elapsed.

It is then necessary to determine the value of a timer tick. Thevalue of the timer tick is typically a multiple of the microprocessorclock speed. It could be fixed, or user-programmable.

For example, an 8 MHz microcontroller has a cycle time of 125 nsec.A timer-chip on this microcontroller has the timer tick userprogrammable as 1x, 4x, 16x, or 256x, depending on the bit-patternwritten to one of the timer's control registers. Suppose 16x is chosen.This means the timer-tick is 16 times 125 nsec, or 2 µsec.

This yields a mechanism with a resolution of 2 µsec, andusually an accuracy of twice the resolution, meaning 4 µsec. Withthis accuracy, it is possible to measure execution time of rather smallcode segments.

If an RTOS is being used, there is a possibility that the RTOS hasalready configured the timer/counter chip. In such a case, either use asecond timer/counter chip if one is available, or use the same chip asthe RTOS, but only read it. Do not change the timer configuration inany way, as that can cause the RTOS to crash.

A question arises as to where do the answers go? In the clock()example earlier, a print statement displayed results. But on a systemin which this timer/counter method is used, there is a good possibilitythat a video display is not available.

If a small display is available (even a simple 4-digit 7-segment LCDdisplay), then values can be shown on the display. An alternative is tosend the data out on an output port, and collect it using a chartrecorder or logic analyzer. A third possibility is to store the data inmemory at a known location, then to peek into that memory using adebugging tool or a processor's built-in monitor.

One issue that needs to be considered is overflow. If the timer is16 bits, and its resolution is programmed to be 2 µsec, then itwill reset and start over every 130 msec. As a rule of thumb, themethod should be restricted to measuring code segments that are at most10% of this maximum range, meaning up to about 13 msec for a 16-bittimer with 2 µsec resolution.

In such a case, if the measurement is continuing on a periodicbasis, approximately 1 in 10 readings will be wrong, as it coincideswith the timer overflowing. That reading needs to be spotted anddiscarded. This is quite easy to do as long as the code segment takesabout the same amount of time every time, in which case the datareading that is discarded is the one that does not make sense.

Another issue occurs in a preemptive environment or when interruptsare present. If the code segment being measured can be preempted, thenfalse data readings will be provided every time such a preemptionoccurs within the code segment.

Several possibilities exist. One is to disable interrupts whenever ameasurement begins, then re-enable interrupts when the measurementends. This could affect real-time performance by causing priorityinversion and cause the application to not meet the specifications.

But often it is acceptable to do during the testing phase in orderto get the measurements of various code segments. A second alternativeis to discard readings that are much longer than the average reading,as they represent measurements that include preemption. Anytimereadings are discarded, care must be taken to not accidently keep anincorrect reading and discard a valid reading.

As a general rule, any discarding of data must always be done withgreat care. Only discard a value if there is a reasonable explanation.If there is concern that a good value might accidentally be discarded,and such a mistake cannot be tolerated, then use a different methodthat is not subject to the overflow of the timer chip, or more suitedto account for preemption.

Logic Analyzers
A logic analyzer is one of the best tools for accurately measuringexecution time with microsecond resolution, especially when accuratetiming is essential. The drawback is that the it requires specializedhardware and more effort than some of the previous techniques describedabove.

There are two approaches to using a logic analyzer. One approach isto hook up the probes to the CPU pins. Connecting the logic analyzer toa CPU emulator or using a bus analyzer has the same effect. While thismethod is least obtrusive on the real-time code, it is also the mostdifficult, as it requires reverse engineering the code to correlatelogic analyzer measurements with the source code.

Some logic analyzers provide processor disassembly support, but thatonly provides correlation to the assembly code, and not necessarily tothe source code. This approach is not advocated, as it is verydifficult and does not yield answers that are any better than the otherapproach described next.

However, a variation of this approach is to monitor only a singlememory location, in which case this becomes the same as the otherapproach.

The other approach is to send strategic signals to an output port,which are read by the logic analyzer as events. The code isinstrumented to send signals at the start and end of each code segment.

The instrumentation is encapsulated within a macro, so thatredefining the macro to an empty statement disables the instrumentationwithout the need to change any part of the application code. Thisapproach is compatible both with large applications that use commercialRTOS and smaller systems base

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.