Tutorial: Techniques for measuring execution time and real time performance – Part 2

Knowing the execution time of various parts of the code is helpful foroptimizing code, but it is not sufficient to analyze real-timeperformance. Rather, measuring execution time is a necessary steptoward fully understanding the timing of an embedded system.

In this article time is used as the basis for strategicallymeasuring different parts of the code, to identify whether or not thereare any timing problems in the system.

Utilization of Tasks
One of the more important values to quantify is the worstcaseutilization of each task. Such utilization is used as the basis formost real-time scheduling analysis. While this section does containsome mathematical analysis a simplified description is provided so thatusing the math is straightforward. The worst-case utilization of a task(U i ) is computed as the ratio betweenthe task's worst-case execution time (C i ) and its period (T i ).That is,

U i = C i /T i .

For periodic tasks, T i is specified by the designer.Described here is a technique for validating that the system is indeedexecuting a task at the proper period. If the system is specified asfrequency or rate (e.g. “execute task at 40 Hz”) then T i =1/frequency (40 Hz means T i =25 msec). For aperiodic tasks whichare tasks that execute in response to randomly occurring events orsignals, T i is the minimum interarrival time,which means the smallest amount of time between two successiveoccurrences of the event.

This is a feature of the application that needs to be specified ifany type of real-time guarantees are to be provided. C i is a value that can be measured using one of the methods describedpreviously. For purposes of discussion the remainder of this sectionassumes the logic analyzer method described in Part 1 is used.

To measure execution time of a task the task must have an analyzabledesign. This means that it has a definitive starting and stopping pointeach cycle. Preferably there is also minimal blocking within the task.Any need to wait for input or delay providing output should be done atthe beginning or end of the cycle. An example of a simple yet goodmodel of a task is shown in Figure 1below .

Figure1: Basic structure of periodic tasks and aperiodic servers that ourmethod assumes for purposes of automated profiling and analysis.

The instrumentation points shown in the figure indicate the bestplace for placing the MEZ_START and MEZ_STOP macros. Everytime the task is executed data points will be logged on the logicanalyzer. Of course, doing this for every task and looking at theresults of the logic analyzercan be very tedious. Thus it is recommended to automate as much of theprocess as possible.

Instead of instrumenting specific code segments the MEZ_START and MEZ_STOP macros are used toinstrument a framework, rather than any of the application code.

Figure 1 illustrates themodel of a real-time task thatenables instrumenting a framework. The task is defined by a collectionof functions, including the cycle() and sync() functions [6]. Thecorresponding framework code in C is also shown. MEZ_START is placed at the beginningof each iteration of the task before a task's input is read and itscycle routine is called.

MEZ_STOP is placed after thecycle routine is called and after outputs have been written. The task'sID is passed as the id parameter to the macros.

The code is then executed. Assuming a preemptive environment withthreetasks, Figure 2 below shows asample event log as collected by the logic analyzer. On its own, theinformation is quite cryptic. In this section the discussion focussesfirst on interpreting this data, then on how the data can be used asinput to more complex and accurate analysis.

Figure2: Sample event log collected by a logic analyzer. The real-timebehavior (Note that unused fields from the logic analyzer output arenot included)

The markings on the right-hand side show how to interpret the data.A MEZ_START operation isindicated by a code 5x, where x is the id of the task and the MEZ_STOP operation is indicated by6x. If a preemption occurs the start and stop markers will be nested asshown starting at line 8 where Task C gets preempted.

For tasks that are not preempted it is straightforward to read theexecution time using the relative time. For tasks that are preempted,for example task C starting a line 8, the execution time is thedifference between the absolute stop marker and the start marker minusthe execution time used by other tasks during that time period. In theexample the task C stop marker is at line 17. The execution time oftask C is thus computed, as(55.081″28.677)”(3.0669+5.0944+3.3741+3.0464) = 11.8222 msec.

It is important to note that the measured execution times includemost RTOS overhead but notall of it. As a result, the measured times could contain someinaccuracy. In fact the lower the priority of the task, the lessaccuracy it has because the overhead will affect the execution time ofthe lowest priority tasks more. Further explanation is given later.

If most or all the tasks are periodic and the system is executed,for several seconds or minutes, it is possible to capture manyinstances of execution time for each task. This enables obtaining moremeasurements of each task just in case the execution time of any taskis not constant.

The number of measurements, however, quickly becomes too cumbersometo view manually. The following equation provides an estimate of howmany lines will be in the logged by the logic analyzer assumingsufficient memory in the analyzer:


where E is the number ofevents to log per second N i is the number of interrupt handlersand N t is the number of tasks. A b is the average number of times that each task calls a function thatprovides event log information. If only logging beginning and end ofthe task then A b =2 .Other events can be logged such as beginning and end of interprocesscommunication, in which case A b is increased accordingly.

Since A b isan approximation, E is at bestan approximation. For example, an application with a 1 msec systemclock interrupt handler four tasks with periods of 5 msec, 10 msec, 30msec, and 100 msec, and on average two calls to functions that generateevent logs would require approximately 4K of trace buffer on the logicanalyzer for each second of profiling that is desired.

With a 1 MByte trace buffer 256 sec (or about 4.5 minutes) worth ofprofiling can be performed. With more tasks or higher frequency tasksthe length of the profile would decrease accordingly.

Extracting data from an event log to produce accurate results is notmeant to be done manually. Rather the event log is uploaded from thelogic analyzer to a host workstation and a program written to extractthe data automatically then compute the execution time of each task.

Since the program to convert from the logic-analyzer system to alist of execution times for each task is specific to several factorsincluding the data format produced by the logic analyzer the format ofthe application code and the definitions of the instrumentation macrosit is not possible to provide a general program to perform thefunctions. Writing the program to automatically extract data is also agood exercise for truly understanding the behavior of a real-timesystem.

After automatically processing an entire buffer output similar tothat shown in Figure 3 below isdesirable.

Figure3: Desired measured execution time information for performing aschedulability analysis.

The output shows the results of extracting data for eight tasks. TheC ref column is the estimated C i that was used in analysis. Cmax is C i as measured with the logicanalyzer. C avg is the average case executiontime. C avg is useful when considering softreal-time systems when it is okay to occasionally miss deadlines. Forsuch a system to not be overloaded, the average case must beschedulable even if in the worst case the CPU is temporarilyoverloaded.

It is important to note that C max is not guaranteed to be theabsolute worst-case execution that a task will ever encounter. Rather,it is simply the worst-case execution time for the time periodcollected on the logic analyzer which in this case is 10.18 sec ofdata. Thus, when measuring code it is important to try to execute thecode in a state that produces the worst case execution time.

At the same time it is important to keep in mind that this method isnot foolproof and errors are possible. Real guarantees of worst-casetimings would only occur by using formal verification techniques butthose types of techniques are not yet available for practice asresearch has not yet yielded good solutions. Alternately a timing errordetection mechanism as described in [5] could be used as a basis foridentifying over long periods of time (e.g. days or weeks) whether ornot C ref is ever exceeded.

The column R-Deadline is a count of how many real-time deadlineswere missed. In this example, tasks 3 and tasks 5 through 8 all misseddeadlines. This implies the CPU was overloaded.

The Effect of Operating SystemOverhead
Extracting measured execution time from a logic analyzer log as shownin Figure 2 is fairlystraightforward. However, in preemptive environments the effect of RTOSoverhead for context switching and scheduling must be considered ifvery accurate measurements are needed. This section provides a detailedanalysis of the overhead and its effect on measured execution times.

When reading measured execution time from the log in Figure 2 it is easy to assume thatthe overhead is included in the measurements and evenly distributedacross each task. The problem is it is not possible to evenlydistribute the overhead.

Specifically, the execution time of task i, C i ,is calculated as t end “t start “t preempt where tend is the time the MEZ_STOP macro executed, tstart is the time the MEZ_START macro executed and t preempt is computed as the amount oftime that another task with higher priority executed (i.e. for theexample in the previous section this would be the value (51.975″29.114).

These measurements, however, already include the RTOS, overhead, butnot in a consistent manner. Whenever a high, priority task preempts alow-priority task, the preemption, overhead ( thr ) is added to the execution time ofthe lower, priority task, assuming that the code was instrumented tooutput, to the logic analyzer immediately when it begins executing, atask's period, and again immediately when the task's, period ends.

This is contrary to what is desired: the overhead should beassociated with the higher priority tasks. If the overhead isassociated with the higher priority task then it can be modelled as 2 thr for each task. [2]

This is best demonstrated by example. Consider the schedule shown inFigure 4 below . The goal is toaccurately measure C 1 , C 2 and C 3 ,knowing that there is overhead during each context switch. The eventlog includes the start and stop timestamps at the instants indicated bythe triangles.

Figure4: Schedule showing the effect of overhead on measured execution times.

From this diagram the execution times should be computed asfollowing (where C x,y means task x, cycle y):

However, these are not the results that would be extracted from theevent log. Rather the results would include the overhead, thr ,and result in the following measurements:

Note the inconsistency of the overhead becoming part of themeasurements. If thr is zero or very small then it isnegligible, and the correct values are read. However, consider the caseof a typical RTOS on a 16-bit processor where thr is about 50 µsec. Already the measurement for C 3,3 is wrong by 200 µsec.

Suppose the system consisted of 7 or 8 tasks. The overhead caneasily contribute to errors on the order of milliseconds: suppose therewas a task with low priority that was preempted 40 times. That wouldmean its measured execution time is C k +80 thr which is much higherthan its true execution time.

On the other hand, suppose that the start task event was logged before the overhead and the end task event was logged after the overhead. Thenthe following measurements are obtained:

The overhead for each task is now constant such that themeasurements conform to the overhead model presented in [2].Furthermore the overhead of the idle task, C idle ,is now properly reflected and not inflated by the overhead. The timemeasured for the idle task represents the amount of available executiontime. Since overhead is CPU time used by the RTOS, it is not availableCPU time.

Unfortunately, profiling data with these event log points is notpractical to implement. Specifically, to achieve this type of event logoutput which task is being swapped in needs to be known before it iseven selected as part of the overhead is selecting the task.

On the ending side, when a period ends it is feasible to postponeoutputting the event log until after the overhead. This adds a overheadon each event log as a stack needs to be maintained to keep track ofthe ID of the previously running task to output to the logic analyzer.

The alternative is to recognize the overhead exists and adjust themeasured execution times are extracted from the data log. Assuming thatthe overhead is constant when extracting the data adjust the executiontime of each task by keeping track of the depth of preemption. Theextractor adjusts the execution time such that 2 thr isspread uniformly across every cycle for every task rather thannon-uniformly.

This translates to adding 2 thr to the measured execution timefor any task that is never preempted and subtracting 2(n p ” 1) thr for any task that is preempted where np is the number of preemptions that occur between the start andstop event tags. With this adjustment measured data for each task willbe more accurate.

There still remains an approximation, as there is no guarantee, thatthe RTOS overhead is constant. For example, many real-time schedulersmaintain linked lists of ready tasks sorted by priority. The more tasksin the queue the longer it takes to insert a new task into the readyqueue and thus the more overhead. Nevertheless assuming a constantoverhead thr produces results that are quiteaccurate ascompared to not taking into account the effect of overhead.

Measuring Operating System Overhead
To perform the manual adjustments to measured execution to account foroverhead as described in the previous section it is necessary to knowthe overhead. This section presents a technique for measuring thr and intr .

thr is the overhead of context switching from one taskto another (including scheduling) and intr is the overhead incurred due tobeing interrupted by an interrupt handler.

An easy way to measure overhead is to build an application of a fewperiodic tasks that do nothing more than toggle bits on an 8-bitdigital output port. This is the same port used for collecting datausing the logic analyzer. In this case, however, set the logic analyzerto collect all data and to display timing diagrams instead of statevalues.

Suppose there are two tasks each at a different priority. Theslowest task (in this case connected to channel 1 of the logicanalyzer) would get preempted by the higher task (connected to channel2) yielding a timing diagram as shown in Figure 5 below .

Figure5: Measuring the context switch overhead.

The RTOS overhead can be approximated as


The reason t 3 is also subtracted from t 1 “t 2 is that during the longpulse of t1 some code of t 1 was executed: thecode to pulse on channel 1 is during the low cycle that is when channel1 is 0. In this case t 3 should be measured as the length ofa 0 pulse instead of the length of a 1 pulse.

The context switch overhead includes time for the RTOS to performscheduling. Quite often the time for scheduling is a function of thenumber of tasks on the ready queue. To observe overhead with multipletasks create eight tasks each toggling a different bit on the digitaloutput port then show the timing of all eight tasks on the logicanalyzer.

Anytime one task preempts another the same calculation as shown in(2) can be performed. A better approximation of overhead with largernumber of tasks can be obtained. As with measuring execution time thereis no guarantee that exact answers will be obtained. But in general aclose approximation is sufficient to either identify timing problemsand to use in the real-time schedulability tests.

Measuring the overhead of an interrupt handler can be done in thesame way. The interrupt handler, however, only needs to toggle a bitonce to show that it has executed. The start of the pulse is then thestart time of the interrupt handler and the end of the pulse is the endtime. If interrupt latency needs to be measured then the actualhardware IRQ line should be connected to one of the logic analyzerchannels. The difference in time between when the IRQ line becomesactive and the pulse of the interrupt handler is seen is the interruptlatency.

Detecting Timing Errors
An important piece of information that can be extracted from an eventlog is the presence of timing errors. The most common timing error is amissed deadline. Each missed deadline corresponds to a problem withmeeting timing constraints in the system and thus identifying themenables the designer to rapidly pinpoint problems.

MissedDeadlines. The event log captures the start and end times ofeach cycle of a periodic task. However, the event log has no knowledgeof the deadlines for each task, hence it cannot identify misseddeadlines. The best solution in this case is to have the RTOS detectthe timing errors. My article published in January 1997 inCommunications of the ACM [5] provides details for efficiently buildingsuch a mechanism into an RTOS. With such a mechanism it is easy to logan event on the logic analyzer whenever a missed deadline is detected.

Unfortunately, most commercial RTOS do not provide missed deadlinedetection. In this case, it is possible to deduce the deadlines formost (and sometimes all) the tasks from the event log, assuming thatthe deadline is always the start of the next period.

To find all the deadlines for Taskk ,search the event log for an instance where the start of a cycle for Task k preempts a lower priority task.This indicates a real start of a period, rather than a delayed startdue to a high priority task using the CPU. Using the time of this eventas a baseline, continually add (or subtract) the task's period to thistime, to obtain every deadline for the task. This method fails only ifa task never preempts a lower priority task, which only occurs if theCPU is overloaded during its steady state.

When a task misses its deadline, it continues to use CPU time thatbelongs to itself during the next cycle, or to another task. A properlybuilt real-time system has a backup plan, so that when a misseddeadline is detected, some kind of action is taken to ensure that itdoes not begin a sequence of every deadline being detected.

One advocated method is to complete execution for the current cycle,but intentionally skip the next cycle. This prevents the task fromtrying to catch up thus overloading the CPU even more than it alreadyis overloaded. If this approach is taken in the real-time system, thenparsing the event log must take into account that following a misseddeadline, the next deadline is skipped. A more in depth discussion ondealing with missed deadlines in the real-time system is given in [5].

Measured Period. Depending on the RTOS and hardware, it is possible that code is notexecuting at the rate or period as specified by the user, due tolimitations in the software or hardware. As an example of an errorcaused by limitations of the RTOS, if a task is supposed to executeevery 30 msec, but it is detected to be executing every 40 msec orevery 20 msec, then obviously there is a problem in the system.

This type of error is very difficult to observe through any methodother than a timing analysis. If the task is running to fast, it mightuse critical CPU resources, and cause other tasks to fail to meettiming constraints. If the period is slower than expected, theperformance of the system might be degrade; in a control system thiswould show up as reduced accuracy of the application. In a signalprocessing application, this can show up as missed packets or droppedframes.

There could be many reasons for the period changing. For example,suppose a task needs to execute 400 times per second, which is a periodof 2.5 msec. The RTOS uses a system clock set to 1 msec. The RTOS mightround up the request to 3.0 msec, or truncate to 2.0 msec, without anyindication to the system designer.

As an example of hardware limitations creating timing errors.Sometimes it is not possible to provide the exact timing base requestedby the designer. For example, the definition of  “one second”might be approximated. Since most timers are based on the CPU's clockfrequency, and prescalar factors are generally a multiple of 2 n ,it is not always possible to setup a system clock to exactly thedesired rate.

For example, rather than 1 msec, an RTOS might be forced to programthe timer with a value that gives the closest value to 1 msec, but notexactly 1 msec. Suppose it is 998 µsec. This means that when thedesigner believes one second has passed, it is really 998 msec.Although this may not seem like much, it amounts to 172 sec (almost 3minutes) per day! Suppose the software relies on the system clock toshow the time-of-day, customers will get upset because their clock isgaining 3 minutes per day.

Using the logic analyzer to timestamp events will expose suchproblems, and allow the designers to adjust their design as necessaryif accurate real-time is needed. If the RTOS's timer is used, then itis possible that the timestamps in the event log are also skewed bythis amount.

The event log does not directly store information about a task'speriod. Rather, the period of a task needs to be deduced, in a similarmanner as deducing deadline times. Two real starting points for thetask must be found in the event log, using the same procedure asfinding a real starting point for detecting missed deadlines.

The difference between these two points, divided by the number ofiterations of the task between tho

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.