Statistical Profiling: An Analysis

Performance is critical in most real-time systems. As any programmer knows, finding the causes of poor performance can be a difficult and time-consuming task. Often, the problem is easy to fix, if and when the root cause is discovered. Statistical profiling is an important tool for improving the performance of any program. It is important in general programming as it measures the execution pattern of your application on a procedure-by-procedure basis by taking samples of the execution state. These samples are used to calculate where the CPU is spending its time. This information is exactly what you need to tune your time-critical system for maximum performance.

In many real-time operating systems, such as VxWorks, each task has an associated stack. The stack is used as a temporary storage area for both local program variables and program execution state. Each time a new function is entered, the processor allocates a new section of the stack to hold the information that function needs. This section is called the function's stack frame. It contains the function's temporary variables and several key pieces of information, including saved processor registers and the return address of the parent routine. A register in the processor, called the frame pointer (FP), always points to the currently executing function's stack frame. When a subroutine is called, the old frame pointer is saved on the stack, a new stack frame is created, and the frame pointer is updated. Thus, at any instant in time, the entire function call history is present on the stack and can be determined by “walking” down the chain of frame pointers stored on the stack.

Statistical profiling works by sampling this execution state. By examining the program counter, the profiler can determine which function in your program was executing at the time of the sample. By examining the chain of stack frames, the function sequence that resulted in the current execution state can also be determined.

The sampling clock must be relatively independent of the executing program so that it can generate “random” samples that can be compiled into statistics to show the program's execution patterns. This compilation is performed by a low-priority task on the target. The steady-state CPU usage for a “typical” application profile is only about 3-5%, which is by default excluded from the profile display. On-target analysis has the huge benefit of avoiding all network traffic by the tool itself. Thus, networking applications can be accurately profiled.

As an example, consider this portion of a network server's execution tree as analyzed by Real-Time Innovations (RTI) product, ProfileScope:

         ->ProfileShowTree "_TcpWrite",0,0.5     Profiling task 0x3b20e0: net4        Current     sum %  direct %     10.43    .12   _TcpWrite     10.31   0.01   .  _write     10.27    .18   .  .  _iosWrite     10.06    .16   .  .  .  _send      9.35    .64   .  .  .  .  _sosend      7.92    .27   .  .  .  .  .  _tcp_usrreq      7.25    .88   .  .  .  .  .  .  _tcp_output      4.97    .49   .  .  .  .  .  .  .  _ip_output      3.87   0.06   .  .  .  .  .  .  .  .  _eiOutput      3.80    .51   .  .  .  .  .  .  .  .  .  _ether_output      2.32    .31   .  .  .  .  .  .  .  .  .  .  _eiTxStartup      1.12   1.12   .  .  .  .  .  .  .  .  .  .  .  _bcopy       .69    .69   .  .  .  .  .  .  .  _cksum       .63    .63   .  .  .  .  .  _bcopy

The example shows the server's TCP write activity. The profile clearly shows the overhead incurred at each layer: the I/O system, the socket layer, TCP, IP, and the Ethernet driver. Both cumulative (including all subroutines) and direct CPU usage of each routine are captured. You can easily see how much time was required to copy data (bcopy) and calculate checksums (cksum). Note that operating-system code is profiled along with user code, a critical requirement for system performance tuning.

The sum column indicates how many times that routine was active, either on the stack or directly. The direct column indicates how many times the program counter was directly executing that routine (the direct values don't always add up to match the sum, because any call chain using less than .5% of the CPU is not normally displayed). From this output, many important facts can be learned. For instance, you can see the overhead incurred in each layer of the network protocol, the total percentage of time spent writing to the socket, and the number of times the data was copied.

A profiler may keep both short and long-term statistics. The long-term statistics simply reflect the activity of the CPU for the entire time profiling has been active. The short-term statistics offer a dynamic record of the recent CPU activity. This dynamic record reflects the activity of the CPU over the last few seconds.

Dynamic records are used to analyze load changes as a program executes, with the output plotted in real-time. Many profilers use a bar chart format. Figure 1 contains an example of a dynamic record, showing a plot with RTI's data monitoring tool, StethoScope, with vertical bars and a time history.

There are several tools that allow real-time programmers to collect performance measurements. In addition to function-level statistical profiling you can use operating system monitors and coverage analysis. All these tools are useful for tuning your application code for maximum performance. While they are very different, their output is quite complementary.

Figure 1: Dynamic Execution Profiling
Since the profiler keeps dynamic CPU usage statistics, it can be used to analyze changing programs. Here, a network server handles a varying connection load. Note that each function making up the load can be independently measured.

Other Types of Performance Tools

There are many tools that allow real-time programmers to collect different performance measurements such as operating system monitors. These types of tools give information quite different from statistical profiling. For example, if your application is thrashing (spending most of its time context switching), a profiler will simply report that most of the CPU is being spent in the kernel. An OS monitor will show you which tasks are switching, and display the events (e.g. interrupts) that are causing the switches. The monitor is clearly more useful in this situation.

On the other hand, if you have one task hogging the CPU, an OS monitor will only tell you that one task is running most of the time. Statistical profiling, however, will provide a detailed function-by-function analysis, breaking down the individual routines within the task that are burning the CPU. It will give you a direct map of what that task is doing, which routines are being called, what routines they call, and point out exactly where the inefficiencies are.

Statistical profiling won't reliably determine if every line of code has been executed at least once; that requires a coverage analysis tool like Tornado's CodeTest/Coverage. A profiler measures how much CPU each function in the program is using over a reasonable time period. This is the information needed to speed up the overall execution of the system.

Although it cannot evaluate absolute coverage, a profiler can analyze the general execution tree of the system to give you an idea how the various routines are being executed. In some cases, very complete maps can be generated. Further, some profilers, like ProfileScope, have the advantage that they require no code modification and can thus analyze already running code as well as operating-system routines.

When To Use a Statistical Profiler

The most important applications for a profiler in a real-time system are:

  • Identifying excessive usage of the CPU, so you can optimize the code, or re-code to not utilize CPU-intensive functions.
  • Timing the execution of a set of routines, e.g. to determine what routines contribute to the time of response to some event.
  • Analyzing execution trees.

Example Usage of a Statistical Profiler

Examples of each of the following applications, (see Analysis) show the speed with which a statistical profiler can acquire the data needed to analyze the software.

In Application 1, the profiler must examine a large number of routines (often the entire system). The routines in question are high-probability events; it's not much use optimizing routines whose total usage of the CPU (including all subroutine calls) is less than about 1%. For this application, accuracy is not that important. The analysis assumes you need a two-digit accurate measurement of a routine that uses 1% of the CPU. The case of a routine that uses only 0.1% of the CPU is also examined for completeness. Asking for two-digit accuracy for an event with a probability of 0.001 (1/1000 of the CPU) requires a tool capable of measuring 1 part in 100,000. Building an execution tree this detailed is roughly equivalent to building a complete map of every mile a car has driven over its entire 100,000 mile lifespan.

Application 2 addresses the same problem as Application 1, but handles situations where it is important to know the exact number of times each procedure being timed has been executed. This is not usually a problem as much code is periodic. If the code is not periodic, a simple counter can be used. Given the execution frequency, the CPU percentage can be directly translated into elapsed time.

Application 3 requires mostly that the profiler catch most of the “leaves” of the execution tree. The leaves may be extremely rare events, but high accuracy in percentage usage by those leaves is not as important. The analysis assumes you want to catch every event that used at least one ten-thousandth of the CPU to one significant figure.

Profiling Boosts RTOS Performance

Performance is critical in many real-time systems. Getting the most out of your CPU cycles is often the key to top performance. Profiling and tuning code is one of the most important ways to increase performance.

Statistical profiling is simple to use, minimally intrusive, and flexible. As we have seen here, it is also capable of accurate measurements. In just seconds, the profiler collects enough samples for accuracy of less than 1%. Collecting for a minute or two yields much higher accuracy.

Statistical profilers, such as Tornado's ProfileScope, provide a detailed function-by-function analysis, breaking down the individual routines within the task that are burning the CPU. It creates a direct map of what the CPU is doing, what routines are being called, what routines they call, and points out exactly where the inefficiencies are. This information is exactly what's needed to tune time-critical systems for maximum performance.

The accuracy of a statistical profiler's measurement is determined by the number of samples the profiler collects. We seek to understand how long we must let the profiler collect data to insure accuracy that will accomplish our purpose. For a given collection rate, this means we need to calculate the number of required samples.

The detailed statistics are beyond the scope of this paper. We use a well-known result: for a confidence coefficient g of a confidence interval e, you need at least N samples of a binary process, where N is conservatively estimated by:

where p is the probability of the binary event, q = 1-p, and Z is the inverse of the Normal distribution integral function. The binary event is the event that a sample contains the routine in question. What this means is that to be g sure that you are within an error of e, you need to collect N samples.

Application 1: Optimizing Code
Optimizing code so your program will run faster is by far the most common application of profiling. To accomplish this, you must identify routines that are using large percentages of the CPU so you know where to concentrate your efforts to speed the code.

Most profiling applications are well served with 95% confidence to the nearest half a percent or so. What this really means is that you are 95% sure that the numbers you are looking at are within plus or minus .25% (a range of .5%) of the true values. That level of accuracy is available almost immediately. In particular, for a routine using 1% of the CPU:

p = .01, q = .99, g = .95, e = .005, Z[…] = 1.96

This works out to only 1521 samples, or about 14 seconds of data collection at ProfileScope's default 107Hz.

A common, very harsh statistical test is to be 95% sure of two-digit accuracy. For this level of confidence in a measurement of a routine using 1% of the processor, you would have:

p = .01, q = .99, g = .95, e = .001, Z[…] = 1.96

Substituting, we find that we need 38032 samples, or about six minutes worth at Profile's default 107Hz. If we increase the sample rate to 500Hz (analyzing 500 samples a second is about as fast as is reasonable for most applications), this takes only about one minute. Thus, even this level of accuracy is completely reasonable for a typical statistical profile.

To achieve this level of accuracy for a more rare event, say one that uses only 0.1% of the processor, more samples are required. Using p = .001 and e = .0001, this accuracy requires 384160 samples, or about an hour at 100Hz, 12 minutes at 500Hz. This would require some patience, but as noted above, this is an exceedingly fine-grained dirt-on-the-car measurement.

Application 2: Timing Routines
Suppose you have a device driver, and you want to measure the time of execution of every routine in it, accurate to 1us.

We'll assume you put the driver into a tight loop, and in that loop, it executes every 0.001 sec. (1000 times a second). Suppose a routine within the driver takes 10us. In one second, the CPU will spend 10us x 1000 = 10ms in this routine, so p = .01. To get 1us accuracy, you will need to know the CPU usage to 1ms (.001 sec) accuracy for the 1000 loops. As above, this means p = .01, e = .001. It will take six minutes to collect at the default 107Hz, one minute at 500Hz.

If the driver doesn't execute as often, you can use selective sampling. ProfileScope supports selective sampling through a simple on/off API. Restricting the sampling to only your desired code sections allows a higher effective sampling rate. This can dramatically shorten the time required to get accurate results. Of course, this intense sampling may slow down the system somewhat, but it may provide the data needed.

Application 3: Analyzing Execution Trees
Finally, building an execution tree that catches every routine using one two-thousandths of the CPU with 95% probability would only take one minute at 107Hz. Building an extremely accurate tree by catching every event that used at least one ten-thousandths of the CPU to one significant figure, (p = .0001, e = .00005) requires 153664 samples, or about 24 minutes at 107Hz, five minutes at 500Hz.

About the Author
After serving as manager of the Stanford University Aerospace Robotics Laboratory for three years, Stan Schneider founded Real-Time Innovations in 1992 to develop and market productivity tools for the real-time marketplace. He is a recognized expert in real-time software systems and architectures, with experience in data acquisition systems for automotive impact testing; software lead for PC and communications products; and management of, and extensive consulting experience in, real-time digital signal processing and computer systems.
He authored many innovative tools for the VxWorks environment from Wind River Systems, including the RTILib package, as well as the StethoScope and the original ControlShell system from RTI. He also chairs the Wind River User's group.
Mr. Schneider completed his Ph.D. in Electrical Engineering and Computer Science at Stanford University. He holds a B.S. in Applied Mathematics (Summa Cum Laude) and an M.S. in Computer Engineering from the University of Michigan.

References
Applied Statistics, by Neter, Wasserman, & Whitmore; Allyn & Bacon, 1978. (See section 11.2)
Probability, Random Variables, and Stochastic Processes; by Papoulis; McGraw-Hill, 1984. (See section 5-3)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.