With the advent of the new computing architectures such as multicore processors and Intel's Atom in embedded systems, developers who want to use the Linux operating system are presented with a dilemma. On the one hand, in embedded applications good performance is essential. But at the same time, Linux is becoming established as an embedded operating system option.
The two trends combined pose an interesting problem statement: “How to get the most out of my embedded application running on an Intel platform and a general purpose operating system?” During embedded application development, there comes a time when a certain level of performance analysis and profiling is required, either to fix an issue or to improve on current performance.
Whether it is memory usage and leaks, CPU usage or optimal cache usage, analysis and profiling would be almost impossible without the right tool set. This paper seeks to help developers understand the more common tools available and select the most appropriate tools for their specific performance analysis needs.
In this Part 1 in a two part series, we summarize some of the performance tools available to Linux developers, both open source and platform independent, as well as those for use on the Intel architecture.
In Part 2 in this series, we will present a set of a set of standard performance profiling and analysis goals and scenarios that demonstrate what tool or combination of tools to select for each scenario.
In some scenarios, the depth of analysis is also a determining factor in selecting the tool required. With increasingly deeper levels of investigation, we need to change tools to get the increased level of detail and focus from them.
This is similar to using a microscope with different magnification lenses. We start from the smallest magnification and gradually increase magnification as we focus on a specific area.
This series will not cover the specifics of addressing each issue once it has been clearly identified, since each of these methodologies would warrant a separate article.
Tools Available Linux Performance Analysis
Listed below are some of the many of the tools available for doing performance analysis on systems running Linux. The list is by no means exhaustive since every developer has his or her own preferences. At the end of this two part series are links to locations on the Web where further information is available.
top/ps . The top and ps commands are freely available on all Linux distributions and are generally installed by default. The ps command provides an instantaneous snapshot of system activity on a per-thread basis, whereas the top command provides mostly the same information as ps updated at defined intervals, which can be as small as hundredths of a second.
They are frequently overlooked as tools for understanding process performance at a system level. For example, most users tend to use the ps “ef command only to check which processes are currently executing.
However, ps can also print useful information such as resident set size or number of page faults for a process. A thorough examination of the ps man pages reveals these options. Likewise, top can also display all this information in various formats while updating it in real-time. The top command window also displays summary information at the top of the window on a per-CPU basis.
In Figure 1 below , we can see top showing information for all threads of a process on a multi-core machine. Using this more detailed view, we can see total activity on each CPU, all threads of the process “app” and on which CPU each thread is scheduled at that instance (P). We can also see memory usage for the process including resident set size (RES) and total virtual memory use (VIRT).
|Figure 1. Top View (Idle System)|
In Figure 2 below , we can see similar information using ps. We can see the CPU usage on a per-thread basis with 1/10 % accuracy. This is the cumulative CPU percentage since the spawning of the thread. As can be seen, top and ps provide a good general overview of system performance and the performance of each process running on the system.
|Figure 2. ps View (Idle System)|
free . The free application is freely available on all Linux distributions and is generally installed by default. As shown in
, similar information can be found using top or sar, but it is a convenient command to view a snapshot of system memory usage and can be used to identify memory leaks (the allocation of memory blocks without ever freeing them) or disk thrashing due to excessive swapping.
|Figure 3. free View (Idle System)|
oProfile . The oProfile utility (Figure 4, below ) is a system-wide profiler and performance monitoring tool for user space as well as kernel space (the kernel itself can be included in the profiling). The profiler introduces minimal overhead and as such can been seen as relatively unobtrusive.
However, it does require that the gdb debugging (-g) flag be used. Although active since 2002 and stable on a majority of platforms, oProfile still dubs itself as an alpha quality open source tool. The tool is released under GPL and can in fact, be found in post 2.6 kernels by default.
The tool works by collecting data via a kernel module from various CPU counters, then displaying that information to user-space via a pseudo file system in the same way as ps collects data via the “/proc” file system.
|Figure 4. opreport from oProfile|
gprof . The GNU profiler, gprof, is an application-level profiler (Figure 5, below ). The tool is open source, licensed under GDB and is available as standard on most Linux distributions.
Compiling the code using gcc with the “pg flag instruments the code producing an executable that measures the wall clock execution time of functions with a hundredth of a second accuracy and exports this information to a file. This file can then be parsed by the gprof application giving a flat-profile representation of the performance data and a call-graph.
|Figure 5. gProf view|
The profiler collects data at sampling intervals in the same way as many of the tools described in this article. Therefore, there may be some statistical inaccuracies of the timing figures if the run-time figure is close to the sampling interval. By running your application for long periods of time, you can reduce any statistical inaccuracies.
As can be seen from the output in Figure 5, gprof can help locate hot spots at function granularity. However, it also allows you to compile this information at a finer granularity using the “l flag. As an unexpected side-benefit, gprof can suggest function and file orderings within your binary to improve performance.
valgrind . valgrind is an instrumentation framework that can be used primarily for detecting memory-related errors and threading problems, but is also extendable. It is an open source tool licensed under GPL2. The tool can detect errors such as memory leaks and incorrect freeing of memory.
The valgrind tool detects these errors automatically and dynamically as the code is executing. In some cases it can produce false positives. However, the developers of valgrind claim that it produces correct results 99% of the time and any errors can be suppressed.
Although it is a very useful tool, it can be extremely intrusive as the code runs much slower than its true execution speed (by a factor of 50 in some cases) and needs to be compiled with the gcc “g flag. It is also recommended to be compiled with no optimization of code using the gcc “O0 flag. An example of the execution of a small binary through valgrind can be seen below.
|Figure 6. valgrind Example|
Although, it may be useful in some cases, for real-time applications that wait on I/O, valgrind can be so obtrusive as to make the checking unreliable. However, valgrind can be a highly useful tool when used in conjunction with a unit test and/or nightly build strategy. A clean run of valgrind in a nightly build allows the developer to keep track of any newly-introduced latent memory errors.
Like many of the tools presented here, valgrind is not limited to the purpose that most developers have in mind. For example, valgrind can also check for cache misses and branch mispredictions. The reader is strongly encouraged to read the relevant documentation and play around with this and all tools to fully appreciate their power.
sar . The system activity reporter (sar) is a lightweight open source tool licensed under GPL that is used for collecting system-wide performance measures (Figure 7a, below ). The tool is generally installed by default on Linux, however, sometimes it may need to be installed using the sysstats package.
|Figure 7a. sar System-wide CPU Usage View|
Like top and ps, sar collects data from operating system counters via the proc file system. It provides performance data at system-level granularity reporting on a wide variety of metrics such as CPU usage, disk IO, memory, network IO, and IRQ. The tool can update these values at intervals of a minimum of 1 second.
sar can only provide information at system-level granularity and is used only to provide snapshots and overviews of overall system performance. Spurious or unexpected measurements from sar can be a first indication of performance issues of the system as a whole or of a single process or group of processes (Figure 7b below ). It can be configured to run in the background, constantly providing a readily accessible database of system performance at any second during the day.
|Figure 7b. sar System-wide Memory usage view|
LTT . Linux Trace Toolkit (LTT) consists of a kernel patch and tool chain that gives the user the ability to trace events on the system. These events can be system kernel events (such as context switches, or system calls, and so on) or any application-level event. It is GPL licensed and has minimum impact to the run-time performance of traced applications.
It can be used to isolate performance problems on parallel and real-time systems and analyze application timing. Any code that the user would like to be analyzed needs to be recompiled to be instrumented by LTT.
Alternatively, LTTng (Next Generation) is also available, which adds features such as a GUI Trace Viewer (Figure 8 below ).
|Figure 8. Sample LTTng Viewer Screenshot|
(Click Here to view expanded image )
iostat . The iostat command is used for monitoring system input/output block device loading. With multiple block devices in the system, it can be useful to determine which device(s) is currently the bottleneck. iostat provides a per device view of the number of transfers per second on each device as well as read and write rates. Shown in Figure 9 below , is an example of the “extended iostat device” only output during a large file copy. (Note: Observe the temporary increase in device activity while the file was being copied. )
|Figure 9. Sample iostat View (File Copy Example)|
iotop . iotop is a Python* program with a top-like user interface (Figure 10, below ) that can be used to associate processes with I/O. It requires Python version 2.5 or greater and a Linux kernel version 2.6.20 or later with the TASK_DELAY_ACCT and TASK_IO_ACCOUNTING options enabled.
|Figure 10. Sample iotop View|
Therefore, a potential recompilation of the kernel may be required if these options have not been enabled by default. iotop is licensed under GPL. iotop provides data regarding the amount of Disk IO occurring within the system on a per process basis. This allows the user to determine which applications are using the disk(s) the most.
|Table 1: Linux Performance Tools Summary|
Intel Thread Checker . The Intel Thread Checker is a plug-in for the VTune debugging environment. It can be used to locate “hard to find” threading errors such as race conditions and deadlocks.
Intel VTune . VTune from Intel is a proprietary system-level profiler and performance analysis tool for Intel architecture. It introduces minimal overhead and therefore can be perceived as relatively unobtrusive.
VTune works by collecting data via a kernel module from various CPU counters. This information is collected when an interrupt is generated. The granularity of the data can run from a process level down to an instruction level and is accessible through a highly-usable and configurable GUI.
When fully configured for your application and operating system it can identify performance issues at several levels of granularity from system-level to microarchitecture-level. As a tool for developers, it is extremely valuable since it has a global view at all granularities. OS performance counters can also be monitored and correlated to instruction-level hotspots.
By using this correlation, we can answer questions such as “When the memory use in our system begins to ramp, what happens to our applications CPU usage?” If the source code in your test application is hooked into the VTune application, we can also drill down from the application level into threads and down to code functions.
It is impossible to outline all the features of VTune and indeed many of these tools described here, however, the interested reader is directed to the references at the end of this series of articles.
To read Part 2, go to Performance profiling/analysis methods & techniques.
Mark Gray is a software development engineer with five years experience, currently working at Intel Corp. on Real-Time embedded systems for Telephony. His email address is firstname.lastname@example.org
Julien Carreo is a software architect and senior software developer at Intel with nine years of experience specializing in embedded Real-time applications on Linux for various markets.