Previously, we looked at how Percepio’s Tracealyzer can help developers evaluate performance of their embedded systems, looking at the driver implementation and uncovering performance issues in a Linux interrupt handler, to knowing when to printk. In this second part this case study, we look at evaluating userspace performance and understanding the impact of compiler options on performance.
Evaluating userspace performance
The vast majority of embedded Linux software developers write userspace applications. Since these applications are specific to a certain domain and are highly complex, application developers need an easy mechanism to validate the functionality of their applications and measure performance.
Tracepoints, which are application-specific instrumentation points provided by the LTTng userspace tracing library to capture user-specified data as events, serve this purpose. Tracepoints can be created in two ways: the first, called tracef, is a very simple way to capture all data as a single event. The second allows a developer to create custom events. While the latter mechanism requires significantly more code, it also provides maximum flexibility for collecting data and displaying it in Tracealyzer.
I used the tracef approach here but note that this will also capture kernel traces for two reasons. First, having the kernel trace included can often explain the timeline of userspace events. For example, if there is a long delay between two events in our application, kernel events should give visibility into what caused it. Second, Tracealyzer, as of version 4.4.2, requires some data in the kernel trace to correctly display UST events, although it doesn’t need to be the full kernel trace.
Tracealyzer can also measure the performance of a userspace application. This specific example mimics a function that takes a certain amount of time with the Linux usleep function, adding a tracepoint before the function invocation and another one after to measure the time it takes for the function to complete.
In a real-world scenario, a developer would identify locations in the application where execution time needs to be measured and add the tracef invocations there. For example, a function may have multiple implementation candidates and the execution timing is examined to find the fastest algorithm. Or a specific function may be complex and need characterization of the execution. After compiling the above application, launching an LTTng session on the target, capturing and downloading the resulting trace data, the trace data is examined in Tracealyzer.
The execution time of the usleep function calls can be defined as the time between each pair of “start” and “stop” events. The best way to do this in Tracealyzer is to create intervals for the custom user events.
After saving the custom intervals definition, this appears in the intervals and state machines window and Tracealyzer highlights it in the trace view, allowing the interval timing graph to be generated.
Not surprisingly, all interval executions seem to have lasted approximately 25 ms but clicking on one of the data points reveals valuable timing information about the interval.
Tracealyzer shows statistics on the length of each interval, which would correspond to the execution time of the function of interest. A second box shows statistics on the time between executions of our function of interest, for example the time between a stop event and the next start event. A third box shows how often each interval occurs, which is the time measured from a start to the next start.
The interval plot may be used to identify any anomalous timing of the function of interest, and the information in the selection details view can be used to gather high level timing statistics.
The majority of embedded software developers will be developing userspace applications on their Linux-based embedded system. Tracealyzer, in conjunction with LTTng tracepoints, can be an invaluable tool to determine how well an application is performing, identify any anomalous behavior, and provide high level timing statistics. It can then be used to further troubleshoot any timing issues and improve performance of the application.
Understanding the impact of compiler options on performance
Compiler options can affect the performance of the most innocuous calculations, even a basic sine function. Using Tracealyzer helps developers understand how these options can affect the performance of userspace applications that are conducting more complicated calculations.
I have been able to demonstrate this by calculating 1000 points of a sine wave at a frequency of 100 Hz, sampled at 1 kHz. By using “standard” compiler options, we see a discontinuity in the trace when viewed in Tracealyzer against system time. Adding a printf call to the code snippet that prints each sample to a file would show no discontinuity as this is simply printing the values to a file without any concept of time. However, when outputting the calculated values to a trace file, the system time is included with each trace value.
Changing the options in the compiler can help to reduce this discontinuity. The architecture of a standard CPU such as an Arm processor is designed to efficiently perform integer operations rather than floating point operations, so the compiler converts floating-point instructions into a series of integer-based instructions. This results in a substantially larger number of instructions for the CPU to execute, with a greater opportunity for another task in the system to pre-empt the sine wave computations. This is confirmed by the trace view in Tracealyzer, where the process responsible for the sine wave calculations does get pre-empted by other processes, being suspended for 900 microseconds in the worst case.
Specifying the “-mfloat-abi=hard” option to the compiler tells it to use a set of instructions designed specifically for floating point operations. However, this didn’t produce any real difference in the outcome as there was still the discontinuity and the trace view shows that the sine wave process was suspended for the same length of time – 900 ms.
This is because these floating-point instructions have specific extensions which enable a separate, heavily optimized floating-point unit (FPU) on the processor itself. This needs the “fpu” option to be specified to the compiler. The “-mfloat-abi=hard” option gives the compiler freedom to select the appropriate extension.
Adding the “-mfpu=neon” option instructs the compiler to enable a specific set of extensions for this particular FPU (NEON). With most of the floating-point calculations running on a separate coprocessor, there was little opportunity for other processes to pre-empt the sine wave generation, resulting in the much smaller discontinuity.
Tracealyzer can also visualize the amount of time each sine calculation takes by using custom intervals. Instead of using the tracef call with the computed sine values, the tracing of “start” and “stop” user events can track the calculations.
Compiling and running the application without using the “float_abi=hard” compiler option shows that while most of the execution of the function takes somewhere on the order of tens of microseconds, there are a few outliers. In one case, the function took approximately 200 microseconds to execute, and in another case, it took approximately 1.05 milliseconds!
Adding the hard ABI option (but not the NEON extensions), shows outliers in the range 100—200 microseconds and one invocation that almost took 1.1 milliseconds.
Opening a trace collected when the application was compiled to use both the hard ABI and the NEON extensions, shows the longest execution time now is a little less than 240 microseconds. This highlights the use of the NEON extensions to reduce the worst delay between calculation points considerably.
Using the LTTng library and Tracealyzer, developers can see how certain compiler options impact the performance of userspace applications that perform floating-point calculations. Usually, this sort of analysis is done after the fact, when the application is completed but the observed performance is deemed unacceptable, and it takes a lot of time. We have been able to show that by using a visual trace diagnostics tool like Tracealyzer during the development phase to verify software timing, problems can be spotted and addressed earlier. From the viewpoint of a highly experienced developer, this avoids hidden bugs and saves time and cost later in the project.
Mohammed Billoo is founder of MAB Labs, an embedded software engineering services provider. He has over 12 years of experience architecting, designing, implementing, and testing embedded software, with a core focus on embedded Linux. This includes custom board bring-up, writing custom device drivers, and writing application code. Mohammed also contributes to the Linux kernel and is an active participant in numerous open-source efforts. He is also an adjunct professor of electrical engineering at The Cooper Union for the Advancement of Science and Art, where he teaches courses in digital logic design, computer architecture, and advanced computer architecture. Mohammed received both his Bachelor’s and Master’s of electrical engineering from the same institution.
- Evaluating a Yocto-based Linux system using visual trace diagnostics
- Software tracing in field-deployed devices
- Debugging poor performing or unreachable code
- Lauterbach TRACE32 adds COQOS hypervisor awareness
- Develop new coding habits to reduce errors in embedded software
- Why you should use standards-based development practices (even if you don’t have to)