# Tutorial: Techniques for measuring execution time and real time performance - Part 2

November 14, 2006

Knowing the execution time of various parts of the code is helpful for optimizing code, but it is not sufficient to analyze real-time performance. Rather, measuring execution time is a necessary step toward fully understanding the timing of an embedded system.

In this article time is used as the basis for strategically measuring different parts of the code, to identify whether or not there are any timing problems in the system.

One of the more important values to quantify is the worstcase utilization of each task. Such utilization is used as the basis for most real-time scheduling analysis. While this section does contain some mathematical analysis a simplified description is provided so that using the math is straightforward. The worst-case utilization of a task (Ui) is computed as the ratio between the task's worst-case execution time (Ci) and its period (Ti). That is,

Ui = Ci/Ti.

For periodic tasks, Ti is specified by the designer. Described here is a technique for validating that the system is indeed executing a task at the proper period. If the system is specified as frequency or rate (e.g. "execute task at 40 Hz") then Ti= 1/frequency (40 Hz means Ti=25 msec). For aperiodic tasks which are tasks that execute in response to randomly occurring events or signals, Ti is the minimum interarrival time, which means the smallest amount of time between two successive occurrences of the event.

This is a feature of the application that needs to be specified if any type of real-time guarantees are to be provided. Ci is a value that can be measured using one of the methods described previously. For purposes of discussion the remainder of this section assumes the logic analyzer method described in Part 1 is used.

To measure execution time of a task the task must have an analyzable design. This means that it has a definitive starting and stopping point each cycle. Preferably there is also minimal blocking within the task. Any need to wait for input or delay providing output should be done at the beginning or end of the cycle. An example of a simple yet good model of a task is shown in Figure 1 below.

 Figure 1: Basic structure of periodic tasks and aperiodic servers that our method assumes for purposes of automated profiling and analysis.

The instrumentation points shown in the figure indicate the best place for placing the MEZ_START and MEZ_STOP macros. Every time the task is executed data points will be logged on the logic analyzer. Of course, doing this for every task and looking at the results of the logic analyzer can be very tedious. Thus it is recommended to automate as much of the process as possible.

Instead of instrumenting specific code segments the MEZ_START and MEZ_STOP macros are used to instrument a framework, rather than any of the application code.

Figure 1 illustrates the model of a real-time task that enables instrumenting a framework. The task is defined by a collection of functions, including the cycle() and sync() functions [6]. The corresponding framework code in C is also shown. MEZ_START is placed at the beginning of each iteration of the task before a task's input is read and its cycle routine is called.

MEZ_STOP is placed after the cycle routine is called and after outputs have been written. The task's ID is passed as the id parameter to the macros.

The code is then executed. Assuming a preemptive environment with three tasks, Figure 2 below shows a sample event log as collected by the logic analyzer. On its own, the information is quite cryptic. In this section the discussion focusses first on interpreting this data, then on how the data can be used as input to more complex and accurate analysis.

 Figure 2: Sample event log collected by a logic analyzer. The real-time behavior (Note that unused fields from the logic analyzer output are not included)

The markings on the right-hand side show how to interpret the data. A MEZ_START operation is indicated by a code 5x, where x is the id of the task and the MEZ_STOP operation is indicated by 6x. If a preemption occurs the start and stop markers will be nested as shown starting at line 8 where Task C gets preempted.

For tasks that are not preempted it is straightforward to read the execution time using the relative time. For tasks that are preempted, for example task C starting a line 8, the execution time is the difference between the absolute stop marker and the start marker minus the execution time used by other tasks during that time period. In the example the task C stop marker is at line 17. The execution time of task C is thus computed, as (55.081"28.677)"(3.0669+5.0944+3.3741+3.0464) = 11.8222 msec.

It is important to note that the measured execution times include most RTOS overhead but not all of it. As a result, the measured times could contain some inaccuracy. In fact the lower the priority of the task, the less accuracy it has because the overhead will affect the execution time of the lowest priority tasks more. Further explanation is given later.

If most or all the tasks are periodic and the system is executed, for several seconds or minutes, it is possible to capture many instances of execution time for each task. This enables obtaining more measurements of each task just in case the execution time of any task is not constant.

The number of measurements, however, quickly becomes too cumbersome to view manually. The following equation provides an estimate of how many lines will be in the logged by the logic analyzer assuming sufficient memory in the analyzer:

(1)

where E is the number of events to log per second Ni is the number of interrupt handlers and Nt is the number of tasks. Ab is the average number of times that each task calls a function that provides event log information. If only logging beginning and end of the task then Ab=2. Other events can be logged such as beginning and end of interprocess communication, in which case Ab is increased accordingly.

Since Ab is an approximation, E is at best an approximation. For example, an application with a 1 msec system clock interrupt handler four tasks with periods of 5 msec, 10 msec, 30 msec, and 100 msec, and on average two calls to functions that generate event logs would require approximately 4K of trace buffer on the logic analyzer for each second of profiling that is desired.

With a 1 MByte trace buffer 256 sec (or about 4.5 minutes) worth of profiling can be performed. With more tasks or higher frequency tasks the length of the profile would decrease accordingly.

Extracting data from an event log to produce accurate results is not meant to be done manually. Rather the event log is uploaded from the logic analyzer to a host workstation and a program written to extract the data automatically then compute the execution time of each task.

Since the program to convert from the logic-analyzer system to a list of execution times for each task is specific to several factors including the data format produced by the logic analyzer the format of the application code and the definitions of the instrumentation macros it is not possible to provide a general program to perform the functions. Writing the program to automatically extract data is also a good exercise for truly understanding the behavior of a real-time system.

After automatically processing an entire buffer output similar to that shown in Figure 3 below is desirable.

 Figure 3: Desired measured execution time information for performing a schedulability analysis.

The output shows the results of extracting data for eight tasks. The Cref column is the estimated Ci that was used in analysis. Cmax is Ci as measured with the logic analyzer. Cavg is the average case execution time. Cavg is useful when considering soft real-time systems when it is okay to occasionally miss deadlines. For such a system to not be overloaded, the average case must be schedulable even if in the worst case the CPU is temporarily overloaded.

It is important to note that Cmax is not guaranteed to be the absolute worst-case execution that a task will ever encounter. Rather, it is simply the worst-case execution time for the time period collected on the logic analyzer which in this case is 10.18 sec of data. Thus, when measuring code it is important to try to execute the code in a state that produces the worst case execution time.

At the same time it is important to keep in mind that this method is not foolproof and errors are possible. Real guarantees of worst-case timings would only occur by using formal verification techniques but those types of techniques are not yet available for practice as research has not yet yielded good solutions. Alternately a timing error detection mechanism as described in [5] could be used as a basis for identifying over long periods of time (e.g. days or weeks) whether or not Cref is ever exceeded.

The Effect of Operating System Overhead
Extracting measured execution time from a logic analyzer log as shown in Figure 2 is fairly straightforward. However, in preemptive environments the effect of RTOS overhead for context switching and scheduling must be considered if very accurate measurements are needed. This section provides a detailed analysis of the overhead and its effect on measured execution times.

When reading measured execution time from the log in Figure 2 it is easy to assume that the overhead is included in the measurements and evenly distributed across each task. The problem is it is not possible to evenly distribute the overhead.

Specifically, the execution time of task i, Ci, is calculated as tend"tstart"tpreempt where tend is the time the MEZ_STOP macro executed, tstart is the time the MEZ_START macro executed and tpreempt is computed as the amount of time that another task with higher priority executed (i.e. for the example in the previous section this would be the value (51.975" 29.114).

These measurements, however, already include the RTOS, overhead, but not in a consistent manner. Whenever a high, priority task preempts a low-priority task, the preemption, overhead (thr) is added to the execution time of the lower, priority task, assuming that the code was instrumented to output, to the logic analyzer immediately when it begins executing, a task's period, and again immediately when the task's, period ends.

This is contrary to what is desired: the overhead should be associated with the higher priority tasks. If the overhead is associated with the higher priority task then it can be modelled as 2thr for each task. [2]

This is best demonstrated by example. Consider the schedule shown in Figure 4 below. The goal is to accurately measure C1, C2 and C3, knowing that there is overhead during each context switch. The event log includes the start and stop timestamps at the instants indicated by the triangles.

 Figure 4: Schedule showing the effect of overhead on measured execution times.

From this diagram the execution times should be computed as following (where Cx,y means task x, cycle y):

However, these are not the results that would be extracted from the event log. Rather the results would include the overhead, thr, and result in the following measurements:

Note the inconsistency of the overhead becoming part of the measurements. If thr is zero or very small then it is negligible, and the correct values are read. However, consider the case of a typical RTOS on a 16-bit processor where thr is about 50 µsec. Already the measurement for C3,3 is wrong by 200 µsec.

Suppose the system consisted of 7 or 8 tasks. The overhead can easily contribute to errors on the order of milliseconds: suppose there was a task with low priority that was preempted 40 times. That would mean its measured execution time is Ck+80 thr which is much higher than its true execution time.

On the other hand, suppose that the start task event was logged before the overhead and the end task event was logged after the overhead. Then the following measurements are obtained:

The overhead for each task is now constant such that the measurements conform to the overhead model presented in [2]. Furthermore the overhead of the idle task, Cidle, is now properly reflected and not inflated by the overhead. The time measured for the idle task represents the amount of available execution time. Since overhead is CPU time used by the RTOS, it is not available CPU time.

Unfortunately, profiling data with these event log points is not practical to implement. Specifically, to achieve this type of event log output which task is being swapped in needs to be known before it is even selected as part of the overhead is selecting the task.

On the ending side, when a period ends it is feasible to postpone outputting the event log until after the overhead. This adds a overhead on each event log as a stack needs to be maintained to keep track of the ID of the previously running task to output to the logic analyzer.

The alternative is to recognize the overhead exists and adjust the measured execution times are extracted from the data log. Assuming that the overhead is constant when extracting the data adjust the execution time of each task by keeping track of the depth of preemption. The extractor adjusts the execution time such that 2 thr is spread uniformly across every cycle for every task rather than non-uniformly.

This translates to adding 2 thr to the measured execution time for any task that is never preempted and subtracting 2(np" 1) thr for any task that is preempted where np is the number of preemptions that occur between the start and stop event tags. With this adjustment measured data for each task will be more accurate.

There still remains an approximation, as there is no guarantee, that the RTOS overhead is constant. For example, many real-time schedulers maintain linked lists of ready tasks sorted by priority. The more tasks in the queue the longer it takes to insert a new task into the ready queue and thus the more overhead. Nevertheless assuming a constant overhead thr produces results that are quite accurate as compared to not taking into account the effect of overhead.

To perform the manual adjustments to measured execution to account for overhead as described in the previous section it is necessary to know the overhead. This section presents a technique for measuring thr and intr.

thr is the overhead of context switching from one task to another (including scheduling) and intr is the overhead incurred due to being interrupted by an interrupt handler.

An easy way to measure overhead is to build an application of a few periodic tasks that do nothing more than toggle bits on an 8-bit digital output port. This is the same port used for collecting data using the logic analyzer. In this case, however, set the logic analyzer to collect all data and to display timing diagrams instead of state values.

Suppose there are two tasks each at a different priority. The slowest task (in this case connected to channel 1 of the logic analyzer) would get preempted by the higher task (connected to channel 2) yielding a timing diagram as shown in Figure 5 below.

 Figure 5: Measuring the context switch overhead.

The RTOS overhead can be approximated as

(2)

The reason t3 is also subtracted from t1"t2 is that during the long pulse of t1 some code of t1 was executed: the code to pulse on channel 1 is during the low cycle that is when channel 1 is 0. In this case t3 should be measured as the length of a 0 pulse instead of the length of a 1 pulse.

The context switch overhead includes time for the RTOS to perform scheduling. Quite often the time for scheduling is a function of the number of tasks on the ready queue. To observe overhead with multiple tasks create eight tasks each toggling a different bit on the digital output port then show the timing of all eight tasks on the logic analyzer.

Anytime one task preempts another the same calculation as shown in (2) can be performed. A better approximation of overhead with larger number of tasks can be obtained. As with measuring execution time there is no guarantee that exact answers will be obtained. But in general a close approximation is sufficient to either identify timing problems and to use in the real-time schedulability tests.

Measuring the overhead of an interrupt handler can be done in the same way. The interrupt handler, however, only needs to toggle a bit once to show that it has executed. The start of the pulse is then the start time of the interrupt handler and the end of the pulse is the end time. If interrupt latency needs to be measured then the actual hardware IRQ line should be connected to one of the logic analyzer channels. The difference in time between when the IRQ line becomes active and the pulse of the interrupt handler is seen is the interrupt latency.

Detecting Timing Errors
An important piece of information that can be extracted from an event log is the presence of timing errors. The most common timing error is a missed deadline. Each missed deadline corresponds to a problem with meeting timing constraints in the system and thus identifying them enables the designer to rapidly pinpoint problems.

Missed Deadlines. The event log captures the start and end times of each cycle of a periodic task. However, the event log has no knowledge of the deadlines for each task, hence it cannot identify missed deadlines. The best solution in this case is to have the RTOS detect the timing errors. My article published in January 1997 in Communications of the ACM [5] provides details for efficiently building such a mechanism into an RTOS. With such a mechanism it is easy to log an event on the logic analyzer whenever a missed deadline is detected.

Unfortunately, most commercial RTOS do not provide missed deadline detection. In this case, it is possible to deduce the deadlines for most (and sometimes all) the tasks from the event log, assuming that the deadline is always the start of the next period.

When a task misses its deadline, it continues to use CPU time that belongs to itself during the next cycle, or to another task. A properly built real-time system has a backup plan, so that when a missed deadline is detected, some kind of action is taken to ensure that it does not begin a sequence of every deadline being detected.

One advocated method is to complete execution for the current cycle, but intentionally skip the next cycle. This prevents the task from trying to catch up thus overloading the CPU even more than it already is overloaded. If this approach is taken in the real-time system, then parsing the event log must take into account that following a missed deadline, the next deadline is skipped. A more in depth discussion on dealing with missed deadlines in the real-time system is given in [5].

Measured Period. Depending on the RTOS and hardware, it is possible that code is not executing at the rate or period as specified by the user, due to limitations in the software or hardware. As an example of an error caused by limitations of the RTOS, if a task is supposed to execute every 30 msec, but it is detected to be executing every 40 msec or every 20 msec, then obviously there is a problem in the system.

This type of error is very difficult to observe through any method other than a timing analysis. If the task is running to fast, it might use critical CPU resources, and cause other tasks to fail to meet timing constraints. If the period is slower than expected, the performance of the system might be degrade; in a control system this would show up as reduced accuracy of the application. In a signal processing application, this can show up as missed packets or dropped frames.

There could be many reasons for the period changing. For example, suppose a task needs to execute 400 times per second, which is a period of 2.5 msec. The RTOS uses a system clock set to 1 msec. The RTOS might round up the request to 3.0 msec, or truncate to 2.0 msec, without any indication to the system designer.

As an example of hardware limitations creating timing errors. Sometimes it is not possible to provide the exact timing base requested by the designer. For example, the definition of  "one second" might be approximated. Since most timers are based on the CPU's clock frequency, and prescalar factors are generally a multiple of 2n, it is not always possible to setup a system clock to exactly the desired rate.

For example, rather than 1 msec, an RTOS might be forced to program the timer with a value that gives the closest value to 1 msec, but not exactly 1 msec. Suppose it is 998 µsec. This means that when the designer believes one second has passed, it is really 998 msec. Although this may not seem like much, it amounts to 172 sec (almost 3 minutes) per day! Suppose the software relies on the system clock to show the time-of-day, customers will get upset because their clock is gaining 3 minutes per day.

Using the logic analyzer to timestamp events will expose such problems, and allow the designers to adjust their design as necessary if accurate real-time is needed. If the RTOS's timer is used, then it is possible that the timestamps in the event log are also skewed by this amount.

The event log does not directly store information about a task's period. Rather, the period of a task needs to be deduced, in a similar manner as deducing deadline times. Two real starting points for the task must be found in the event log, using the same procedure as finding a real starting point for detecting missed deadlines.

The difference between these two points, divided by the number of iterations of the task between those two points, will yield an accurate result for the period. From this information, every starting point of the task can then be identified, and the amount of jitter of the starting times can then also be computed.

The goal of a program that automatically extracts timing information from the logic analyzer is to provide an output as shown in Figure 3. This output has all the information needed to perform accurate real-time scheduling analysis, as described next.

Real-Time Scheduling Analysis
Real-time scheduling analysis involves a mathematical verification of whether or not there are potential timing problems in the system. The equations appear very complex, but they are really quite simple to implement when all the data is available. In this section, a simplified view of real-time scheduling analysis when using a fixed priority scheduler is presented.

First, lets present the seemingly complex equation to analyze this, assuming a fixed priority, highest-priority first scheduling algorithm. The theory states that a specific task set is schedulable if and only if:

(3)

This model is a result of integrating the results of several different research efforts. The derivation of the equation given in [3], but it is not necessary to understand the derivation to make use of the above equation. The following description greatly simplifies the equation.

First, lets assume no interrupts. That is, there is only periodic tasks are in the system. This means nintr=0, and the equation reduces to the following:

(4)

If there are interrupts, then they are simply considered as high-priority periodic tasks for analysis purposes. The net effect is a slightly more conservative analysis, because the overhead of a task switch (thr) is used instead of the overhead of an interrupt switch (intr). Almost always, thr > intr.

The minimum function is that the test is supposed to be performed at many different check-points. However, a simplification is to simply use the end-point of the logged data. Assume that the output that was shown in Figure 3 is to be simply set the check-point to 10 sec.

When the check-point is much greater than the period of the slowest task, the analytical difference is negligible: if the system is very close to the threshold of being schedulable and not schedulable, the analysis might mistakenly say it is not schedulable, when in reality it is schedulable. Most of the time, however, a correct answer will be given.

Setting the check-point to a fixed value of 10 sec means t=10 in Equation (4) above, and the minimum function is eliminated. The equation is now

(5)

The first line states that the equation should first be tried with one task, then two tasks, and so on. Always start with the highest priority task, then when trying the equation with the next task, select the next highest priority task. In the output shown in Figure 3, it is assumed that the tasks are ordered from highest to lowest priority.

Using the data shown in Figure 3, the analysis can be performed either with the estimated worst-case execution time Cref  or the actual measured time Cmax. The value Tj is the period Tref. The value 2 thr is a measurement of the overhead, which is obtained ahead of time using the techniques shown earlier. Lets assume that 2 thr was measured to be 100 µsec.

The summation means that the computation is performed for each task, up to the number of tasks being tested. So to start, assume only the first task. Lets use the estimated times as given by the column Cmax. Substituting real values into Equation (5), above gives the following:

(6)

The symbol is the ceiling function, which means compute x/y, and always round up. Thus 6/2 is 3, but 6.1/2 is 4. Therefore [10/0.001] is 10,000. The left side of the equation computes to 0.156, which is certainly less than the right hand side value of 1. That means the system is always schedulable if there was only the single task in the system. Repeating with the first two tasks, the equation becomes:

(7)

The left side now adds up to 0.455. Therefore if only the first two tasks in the system are present, all deadlines will be met. This can be continued for 3 tasks (0.156+0.299+0.322= 0.777), 4 tasks (0.156+0.299+0.322+0.302=1.079), and so on. At some point (in this case when the first four tasks are considered), the left hand side adds up to more than 1.0 meaning that missed deadlines are expected. This is confirmed by the presence of missed deadlines as shown in Figure 3.

So how does this help? Look at the Cref column instead. Repeat the calculations for just the first four tasks. It comes to 0.958. So according to the worst-case estimated execution times (i.e. Cref), the task set is schedulable. But when it was measured, there were missed deadlines. Comparing Cref to Cmax shows clearly that the measured execution time was greater than anticipated, since Cmax > Cref for three of the tasks.

Once this information is known, improving the system can be done systematically, rather than through many iterations of trial-and-error.

Lets suppose the goal is to fix the system so that the first four tasks (PID 0, 1, 2, and 3) are guaranteed to meet their deadlines. The first step is to revise the estimates Cref using the value Cmax. This provides much more accurate estimates of the worst-case execution time.

To make these tasks schedulable, it is necessary for the left side of the equation (computed as 1.079 above) to be reduced down to a value below 1.0. This can be accomplished by either decreasing execution time C or increasing period T for one or more of the tasks:

1) Decreasing execution time means that the task will need to be optimized so that it uses less resources. As a general rule, the more CPU time a task uses, the easier it will be to optimize it. It is also easier in many cases to reduce the execution time of two or three tasks by just a small amount, rather than one task by a larger amount.

2) Increasing the period means a task will execute less often. Whether or not this is acceptable is dependent on both the application and the hardware. Review the reason that a particular period was selected for a task (remembering that period = 1/rate). If it is okay to increase the period, then that is a much easier solution than optimizing the code to reduce execution time.

By simply repeating the math, precisely which tasks need to be optimized, and by how much, can be determined. Furthermore, a fine-tuning of the periods of each task can also be performed. The computed Cref to make the system works provides a goal to the person optimizing code, so that when the goal is achieved, the system will work and the optimization effort can end.

Note that the analysis is performed using worst-case execution time. Quite often, the tasks use less than the worst-case, as indicated by Cavg. So it is possible that on average everything can execute, but in the worst-case it will not. Such a system is called a soft real-time system. Using the above analysis techniques, it is possible to determine which tasks will always execute (i.e. which are hard real-time), and which tasks might miss occasional deadlines.

If interrupts are present, then use Equation (1), which takes into consideration interrupt handling as well. Each interrupt is simply treated as a high-priority task. The period Tref is set to the minimum interarrival time, and Cref is the worst-case execution time of a single invocation of the interrupt handler. If a task set is scheduled using a different scheduling algorithm, then there likely exists an analytical solution that can be used in practice. [3]

To read Part 1 in this series of articles, go to "The importance of schedulability analysis."

David B. Stewart is Director of Software Engineering at InHand Electronics.

References
[1]
D. Katcher, H. Arakawa and J. Strosnider, "Engineering and Analysis of Fixed Priority Schedulers", IEEE Transactions on Software Engineering, Vol. 19, No. 9, Sep. 1993.

[2] A. Secka, "Automatic Debugging of a Real-Time System Using Analysis and Prediction of Various Scheduling Algorithm Implementations," M.S. Thesis, Dept. of Electrical and Computer Engineering, University of Maryland, Supervisor D. Stewart, Nov. 2000.

[3] M. Steenstrup, M.A. Arbib, and E.G. Manes, "Port Automata and the Algebra of Concurrent Processes," J. Computer and System Sciences, Vol. 27, No. 1, pp. 29-50, August 1983.

[4] D.B. Stewart and P.K. Khosla, "Mechanisms for Detecting and Handling Timing Errors," Comm. the ACM, Vol. 40, No. 1, pp. 87"94, January 1997.

[5] D.B. Stewart, "Designing Software Components for Real-Time Applications," in Proc. of Embedded Systems Conference, San Francisco, CA, Class 507/527, Apr. 2001.

[6] TimeTrace, TimeSys Corp.

[7] WindView, Wind River Systems.

• 01.15.2019

• 01.02.2019

• 01.16.2019