In this first of a two part-article, I have documented my experiences in testing the Linux support capability in Percepio’s Tracealyzer on a Yocto-based Linux distribution. In doing so, I highlight the ways in which a visual trace diagnostics tool such as this can help developers evaluate the performance of their embedded systems, from analyzing the driver and interrupt handler to examining the userspace applications and compiler options.
First let’s look at how evaluating a Linux driver implementation. It is important to note that Tracealyzer for Linux takes advantage of LTTng, an open source tracer and profiler that allows developers to evaluate the performance of the kernel, by parsing the output of LTTng and generating visualization and detailed statistics.
The device for which the driver is being evaluated has three main interfaces. The I2C interface controls the device, the SPI interface is used to stream data back to the Linux device, and the GPIO is an interrupt line to indicate when there is data ready to be consumed. When the GPIO line is asserted, the driver sends an I2C command to instruct the device to begin streaming, which will be done over the SPI interface. The driver will instruct the DMA controller (DMAC) in the embedded Linux system to manage the data transfer between the SPI bus and system RAM to ensure that the CPU is able to manage other tasks. Finally, application code exists on the Linux device to retrieve the streamed data from RAM and store it in non-volatile memory.
Tracealyzer is used to validate two important metrics. First, that the amount of time from the GPIO being asserted to when the I2C command is issued is kept to a minimum. Second, that the Linux kernel is providing the driver with sufficient execution cycles to allow it to periodically manage any issues encountered by the DMAC. The goal is to guarantee that minimal data is lost in the streaming process, and Tracealyzer is used to help ensure this guarantee.
Device development kit
Tracealyzer uses files generated by LTTng, so the embedded Linux platform is configured to support LTTng, by creating a custom layer in the device’s Yocto project based board support package (BSP). The resulting Linux image is loaded on an SD card and booted on the development kit. The device driver is loaded and the resulting trace data stored on the card for offline analysis. Tracealyzer is then started up on the host to view and analyze the collected traces.
For investigating the kernel space, the “trace view”, the “actor instances”, the “CPU load graph”, and the “context switch intensity” views are the most appropriate. The trace view (figure 2) shows the swapper (idle task) used in the kernel occupying the most execution resources during the entire capture, which is expected. When the driver is running and transferring data from the device to the development board, the specific kernel thread is given a greater chunk of the execution resources on the platform to perform the transfer.
The next important view is the actor instances. Selecting the “execution time” in the drop-down menu generates a graph indicating the amount of time that is being occupied by any particular “actor” (which is defined as any execution element, such as a thread or process). Selecting one kernel thread shows a few spikes in its execution time at 350, 450, and 550 microseconds. To understand if these spikes are of any real concern, we would need to know the timing requirements of the system or compare it to an evaluation of how the system operates under “normal” conditions.
Selecting another kernel thread from the trace shows a relatively large spike in the execution time of one of the kernel threads. The “CPU load” view depicts the CPU utilization of different actors. We can see that the spike in the execution time of the highlighted kernel thread doesn’t result in an unusually high utilization of the CPU (the maximum CPU utilization is a little over 1%). Thus, that spike is no cause for concern.
The final view is the “context switch intensity” view, which shows whether the kernel driver is performant and not causing the kernel to thrash.
This does not show any significant context switches for any particular kernel thread, relative to others. If there were a performance issue in the driver, then there may be significant context switches of the kernel thread. This could be due to the kernel scheduler giving execution time to the kernel thread, then moving to another thread after some time, but then switching immediately back to the thread if it has demanded execution resources. Again, the determination of whether approximately 20 context switches are acceptable is dependent either on the system requirements or on measurements performed when the system is behaving normally.
These views also provide a quick way to overview the trace and locate “hotspots” or anomalies of interest for further study. This is one of the main benefits of Tracealyzer, as they otherwise can be difficult to find in long traces with many thousands or even millions of events.
Uncovering performance issues in a Linux interrupt handler
Most Linux device drivers use interrupts. The first thing to remember about interrupts is that it’s important to limit the number of operations assigned to the interrupt handler, since the kernel is in a sensitive state, and so is the entire processor. To give an example, while the interrupt handler is running, all interrupts are masked – that is to say no other interrupt can fire; if an interrupt handler is executing for a long time, other interrupts may be missed. So, it’s essential to stick to a bare minimum when addressing the interrupt, such as register operations and moving data around in processor memory. Managing other operations such as the data transfer should be delegated to tasklets or other kernel mechanisms. In the driver example, the device specification states that an interrupt is set to fire every 80 milliseconds, so this defines the maximum time the interrupt handler can take to execute.
Knowing when to printk
Tracealyzer can be used to ensure that the interrupt handler is doing as little as possible. For instance, it all but eliminates all need to use printk statements, compare timestamps in kernel logs, and wade through endless LTTng traces to evaluate performance. Instead, it provides a clear and detailed view of the interrupt handler.
In the data acquisition device, a GPIO signal informs the driver that data is ready to be collected from the device and a printk call in the driver simply records in the kernel log that an interrupt was received, with the return value IRQ_HANDLED informing the kernel that the interrupt was processed.
However, before loading the kernel module and connecting the device, LTTng is started in the target device and instructed to capture nothing but interrupt handlers. After transferring the LTTng traces to the host machine and starting Tracealyzer, the actor instance graph shows only the interrupt handler and only the relevant information for greater focus. With one look, we can show how often the interrupt handler is fired (which in this example is roughly the expected 80 microseconds). However, digging deeper via the “selection details” window and opening the “execution time” option reveals that it took the interrupt handler on average 3.3 milliseconds to execute instances.
When the printk call is removed, with the actor instance graph displayed and the view still set to “execution time”, Tracealyzer shows a significant drop in processing time, which has decreased from 3.3 milliseconds with the printk to 14 µs without the printk.
This difference clearly illustrates the considerable amount of processing involved in a printk call, which must consider a wide range of different cases when printing to the kernel log. To achieve optimum performance, the obvious conclusion therefore is to try and avoid using printk, or any of its derivatives, in an interrupt handler.
Tracealyzer also shows that the execution time tends to vary erratically. While the microsecond differences are admittedly small, it is nevertheless important to understand why such disparities occur.
While the middle portion of the capture shows that the interrupt handler is fired every 80 milliseconds as expected, the number of recorded calls is much higher at the beginning and end. Things get really erratic towards the end of the capture, where execution was called after 325 milliseconds in one instance.
This may be explained by the fact that the device in the interrupt handler was not instructed to stop firing its interrupt. Since the interrupt is always present, the Linux scheduler keeps giving execution resources back to the interrupt handler; this adverse phenomenon is commonly known as “thrashing”, but the printk statement masked the bug.
Interestingly, the periodicity of the interrupt handler stabilizes again after a while even though the device was not instructed to stop calling the interrupt. A closer look at the device specification reveals that a fail-safe mechanism exists on the device, which automatically de-asserts the interrupt after some time if it hasn’t been acknowledged over the I2C bus. Since the time between execution of the interrupt handler with printk is approximately 80 milliseconds in the present example, the execution time runs into the de-assertion phase, masking the fact that the code does not adequately de-assert the interrupt.
Without using Tracealyzer, this bug is effectively hidden and would not have been spotted or manifested itself until shortly before release when the extraneous printk calls were removed. At that late stage, the necessary driver modifications would have incurred significant delays and costs.
Evaluating Linux system performance
Tracealyzer can also be used to tune a Linux system to maximize performance; we’ll illustrate this using a Linux system such as Nvidia’s Jetson Nano and a userspace application such as iperf.
Figure 7 shows the basic setup of this experiment.
Adjusting the CPU affinity of the iperf server on the Jetson Nano reveals how that parameter affects the overall throughput between client and server. The term CPU affinity signifies the particular CPU core that an execution context is pinned. Usually, affinity is set on an application basis. If the CPU affinity of the interrupt and corresponding handler matches the CPU affinity of the packet receiving process, then packet loss should be minimized, because time isn’t wasted moving data between cores – at least that’s the idea.
This analysis and possible optimization first requires determining the affinity of the Ethernet interface (eth0) on the Jetson Nano. This will indicate which processor core handles interrupts from the eth0 interface. By not allowing Linux to choose the processor core and instead pinning the iperf server execution to a specific core, Mohammed aims to evaluate the impact on throughput. As CPU0 is responsible for handling interrupts from the eth0 interface, he pins the iperf server to CPU3.
An iperf test is run from the host machine, and then the LTTng session is stopped and destroyed to avoid having large traces with extraneous events.
This gives some interesting results in the capture when iperf is pinned to CPU3. First, Tracealyzer shows there are four instances of the iperf process running, even though Linux listed only a single instance. But the instance of iperf that corresponds to the PID reported by Linux executes only twice: once at the beginning of the iperf measurement and once at the end.
Even though iperf is pinned to CPU3, there are other instances of iperf executing on different CPU cores. This is actually not that uncommon, since applications can implement their own logic for selecting the appropriate CPU for execution. It appears that iperf has such logic implemented as well.
The trace also shows a flurry of execution instances of the eth0 interrupt handler executing at around the same time as the iperf experiment, showing a correlation between when the eth0 interrupt handler is executing and when the iperf instance is executing.
Tracealyzer shows it takes 55 microseconds from completion of the eth0 interrupt handler to when the iperf instance begins execution. With the system under load with 20 processes across all the CPUs, a new iperf measurement shows that the throughput remains unchanged.
Opening Tracealyzer with a capture taken with the CPU cores artificially stressed shows the execution sequence of the interrupt handler and the iperf instance, with the time between completion of the eth0 interrupt handler execution and the start of the iperf execution approximately 40 microseconds.
While the number itself isn’t important (although it is interesting that the time is actually less under load), it is the same order of magnitude when the system is under load versus when it is not – 40 microseconds versus 55. This is a tremendous feature of the Linux kernel, where even with a userspace application seemingly hogging all four cores of the system, it still ensures that other userspace applications are not starved for CPU resources and that inter-core communication is not affected.
Analyzing the interactions between the different execution elements under normal and strenuous conditions shows a neat feature of the Linux kernel where a best effort is made to provide all processes with a fair share of CPU resources.
The second part of this article will look at evaluating userspace performance and understanding the impact of compiler options on performance.
Mohammed Billoo is founder of MAB Labs, an embedded software engineering services provider. He has over 12 years of experience architecting, designing, implementing, and testing embedded software, with a core focus on embedded Linux. This includes custom board bring-up, writing custom device drivers, and writing application code. Mohammed also contributes to the Linux kernel and is an active participant in numerous open-source efforts. He is also an adjunct professor of electrical engineering at The Cooper Union for the Advancement of Science and Art, where he teaches courses in digital logic design, computer architecture, and advanced computer architecture. Mohammed received both his Bachelor’s and Master’s of electrical engineering from the same institution.
- Software tracing in field-deployed devices
- Debugging poor performing or unreachable code
- Lauterbach TRACE32 adds COQOS hypervisor awareness
- Develop new coding habits to reduce errors in embedded software
- Why you should use standards-based development practices (even if you don’t have to)