Performance analysis of Linux-based embedded systems: Part 2 -

Performance analysis of Linux-based embedded systems: Part 2


This “Product How-To” article focuses how to use a certain product in an embedded system and is written by a company representative.

In addition to having access to the right set of tools (covered in Part 1 ), in any performance analysis or profiling exercise, it's been our experience that there are two critical pieces of information that need to be present from the start:

1. What is my expected system behaviour? In other words, how do I expect the system to behave under normal conditions? In a structured project environment, this translates to a very clearly-defined set of requirements at a system level as well as, possibly, at an individual component or application level.

2. What is my problem statement? Simplistically, this can be one of two possibilities:

a) My system is not behaving according to expectations.
b) My system is behaving as expected, but I want to know what “makes it tick”.

I want to be able to answer questions such as: “Where are my CPU cycles being spent?”, “How much memory am I really using?” This information can be used to understand any inefficiencies in my algorithm or problem areas. This information may also be used to accurately predict how the system will scale to support higher workloads.

When Items 1 and 2 above having been met, you have effectively determined “where you are” and “where you want to be”. For the purposes of this article we will focus on scenarios in which the system is not behaving according to specifications, rather than measurement on a working system.

From experience, it is critical to apply a structured method at the start of any performance analysis since any activity with an inappropriate tool can be a complete waste of time. Performance can be broadly affected by issues in three distinct areas: CPU occupancy, memory usage and IO.

As a first step, it is absolutely essential to determine which area your problem is coming from since the tools mainly focus on one of these three areas to provide any kind of detailed data. Hence, the first step is always to use general tools that provide a high level view of all three areas simultaneously.

Once, this has been done, the developer can delve deeper into a specific area using tools with an increasing level of detail and potentially more and more invasiveness.

It is advised not to make any assumptions regarding the category the investigated problem falls under and skipping the first high-level analysis. Assumptions such as these have proven in the past to be counter-productive on numerous occasions.

When doing performance analysis on a working system to understand what makes it tick, it is important to take into account a number of things. Avoid any over-kill.

For example, if only a simple CPU performance measurement of a working system is required, it may be sufficient to use a non-invasive high-level analysis tool such as ps. The depth of analysis should be determined “a priori” by all interested parties.

Viewing things at 10,000 feet
As stated earlier in this article, the starting point of any analysis should be a set of system-level measurements meant to provide an indication of the system state, most notably:

1) CPU occupancy, total and per logical core
2) Memory usage, snapshot and evolution over time
3) IO, CPU IO waits

For the purpose of this article, it is assumed that we are dealing with finding a single problem area at a time during our analysis, figuring out what that area is that brings us here. Scenarios covering analysis of a system with both CPU occupancy and memory usage problems, for example, is not covered here.

Applying this methodology – that is, performing a high-level analysis that includes CPU, I/O and memory performance – we can see in Figure 11 below that our CPU usage is approximately 90%.

Figure 11. top View (Fully-Loaded Single Core System)

Our main problem here is CPU occupancy consumes the vast majority of cycles, most being spent in user space. Our next step should be to examine more closely the applications running on the system.

In Figure 12 below , we can see that our overall CPU occupancy is just under 50%, however we are using 99% of one core and virtually nothing of the second available core. We should examine our threading model, (see “Optimizing a Complete System” and “CPU Bottlenecks” later in this article .)

Figure 12. top View (Half-Loaded Dual Core System)

Figure 13. sar System-Wide Increased Memory Usage View

Comparing Figure 7 from Part 1 with Figure 14 below , we see that over time, our memory usage is increasing, further measurements may indicate that we have a memory leak that is affecting system behaviour (See “Investigating a Memory Issue” later in this article ).

Figure 14. sar IO Wait CPU Usage View

Using ps, as shown in Figure 15 below , we can see that we have a number of applications running concurrently on the system and that our VoIPapp is by far the biggest CPU user. We should examine our VoIPapp in more detail ( See “CPU Bottlenecks” later in this article .

Figure 15. ps View (Loaded System)

From Figure 15 above , we can see that the CPU is spending an inordinate amount of processing time waiting on IO. We should investigate the reason for the high number of IO waits (See “IO Bottleneck Issue” later in this article .)

Optionally, we can use iostat to assess the loading of the block devices in the system to quickly determine if they are a factor in the bottleneck. For instance, in Figure 16 below , it is apparent that during the file copy, the bottleneck is the block device which is highly loaded.

Figure 16. iostat View (Loaded System)

Optimizing a complete system
Our first pass analysis has led us to believe we should look at optimizing at a system level, that is, there are no particular outstanding CPU bottlenecks, IO over-subscribers or code blocks using inordinate amounts of memory.

In embedded systems, the amount of available resources is typically fixed. So for the purpose of this paper, we will not take into account possible system improvements such as adding more memory or adding an additional disk for more block devices.

Looking at memory usage with free, continuous swapping will be a performance issue. If at all possible, the main memory-using application(s) (see “Investigating a Memory Issue” later in this article ) for finding the big memory users should be analyzed for memory usage reduction through code analysis.

Looking for high CPU-intensive applications using top, one key item to note is if in a multi-core environment we need to pay attention to the CPU occupancy breakdown per available core that is provided by top.

Identifying the main CPU user on a single core and making that application multi-threaded to share the load across cores is a key step in any multi-core system optimization. In the case of multiple heavy CPU-intensive applications, the Operating System scheduler will have already distributed the load over multiple cores.

Looking for high I/O utilization and bottlenecks using iotop and/or sar, optimizing the applications for more efficient use of the device (transfer sizes for instance) is most likely the only option in an embedded system where adding devices is not possible.

Investigating a memory issue
The first-pass analysis has led to identifying a potential memory issue. We should use free or sar to monitor memory usage at selected intervals to see if there is a consistent increase in system memory usage.

Also, take note to determine if during this measurement of swap memory usage, it is swapping that is causing a bottleneck. Use top and sort by virtual memory usage to determine which application is using the most memory, if memory usage is increasing (memory leak), and if any applications are using a lot of swapped memory.

In the case of a memory leak, once we have determined the application that is leaking memory, we should use valgrind to search for memory leak locations.

Considering memory leaks are determined over time, based on multiple measurements carried out, it is important to note that the system must be sufficiently well understood to know when it has reached a stable state when memory usage is not expected to change. Without this information, a developer may misinterpret normal system operation as a memory leak.

Although it may be impossible for an embedded developer to increase main memory to alleviate excessive swapping or disk thrashing to improve performance, it may be desirable for a developer to lock all memory used by an application into main memory so that it does not get swapped out.

While using a large amount of swap space, a developer may note (using gProf or LTT) that the wall-clock time required to access various regions of memory may be greater than that during periods when the swap usage is low.

IO Bottleneck issues
IO bottleneck identification within a full system is arguably the most difficult issue to track down. In networking scenarios, where the network device is the bottleneck, this is not clearly identified without the use of external equipment to generate the appropriate network conditions in and out of the system. However, discussion on the performance for networking IO is beyond the scope of this article.

Based on the tools at our disposal, one IO area where we can get sufficient information for analysis is in the area of block devices and more specifically, disk IO.

Beginning in the “Start at the 10,000 ft View” section earlier, sar and/or iotop provide us with data indicating that the CPU is waiting on a block device which is 100% loaded. For further investigation, we should use iotop to get a “per process” breakdown of IO usage to determine which process is the main device user.

Once the top process has been identified, further investigation is possible through the use of VTune to analyze sections of the application that are contributing to bus/disk utilization.

CPU bottlenecks
As stated in the “Looking at things from 10,000 feet,” earlier, we can use top or ps to sort applications by CPU usage to identify primary CPU users. Then, using VTune on the selected application, we can drill down to module, function and instruction-level code to determine where the hot spots are.

Careful analysis of the code (and maybe assembly code) to understand bottle necks should follow so that algorithms or code can be updated accordingly. Once this is done, the procedure is repeated to further refine the code.

Analysis Flow
For the purpose of clarity and to summarize what we have discussed so far, the following is one possible methodology represented as a flow diagram. This is by no means the only possible method. There are infinite variations, but we hope it can be a good indicator of one way to proceed.

Figure 17. Analysis Flow

(To view larger image Click Here. )

Throughout this series of two articles, we have discussed many of available tools for performance analysis on Intel architecture and Linux. The tools discussed are by no means exhaustive, and as shown in Table 2 below , there are alternative tools available.

Table 2. Alternative Tools for Performance Analysis

By combining these tools with some basic performance analysis methodologies, we hope that we have provided the newcomer with sufficient information to feel comfortable starting a performance analysis task. For veteran developer and testers, we hope this series has been informative and helps you understand the approach we've taken here and tools available for your use.

To read Part 1, go to “Available Tools.”

Mark Gray is a software development engineer with five years experience, currently working at Intel Corp. on Real-Time embedded systems for Telephony. His email address is
Julien Carreo is a software architect and senior software developer at Intel with nine years of experience specializing in embedded Real-time applications on Linux for various markets.

Online Tool References and resources

ps and top


gprof man page




Intel Thread Checker




Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.