Software performance engineering for embedded systems Part 3 – Collecting data and using it effectively
Editor's note: In the final part in a series of three articless excerpted from his soon-to-be published “Software Engineering for Embedded Systems”, Freescale’s Robert Oshana describes the various techniques for collecting performance data, assessing it, and then generating benchmarks that will allow you to monitor your system for effective operation.
Collecting performance data can be difficult if not planned correctly. There are many factors that may affect the measurement of data including:
Representative time periods
Averages for typical behavior
One effective means to produce accurate, well-timed data for analysis is to design probes into the software being developed. It is much easier to define measurement requirements and probe points while the software designers are defining the system architecture. Collecting and analyzing data tends to incur less processing overhead if the probes are integrated as the design evolves.
Principles of analysis
Connie Smith and Lloyd Williams describe a set of analysis principles that are helpful for developing and managing a performance program . The principles that are most relevant for embedded systems development are described below.
Performance control: Using this principle, specific, quantitative, measurable performance objectives for the key system performance scenarios are defined. Avoid vague or qualitative performance objectives, as these are difficult to measure and it's hard to know when you have met the goal.
Instrumenting: This principle states that engineering should strive to instrument systems as they build them to enable measurement and analysis of workload scenarios, resource requirements, and performance objective compliance.
Centering: This principle focuses on identifying the frequently used functions, which are called “dominant workload functions”,) and minimize their processing. The focus should be on those parts of the system with the greatest impact. Generally the 80/20 rule applies to this principle. Figure 20 shows a high level software architecture diagram for an eNodeB application. Software layers 1, 2, and 3 are highlighted in areas where dominant workload functions reside. These areas are where we focus most of our attention.
Click on image to enlarge.
Figure 20. A high level model of an eNodeB Layer 1/2/3 software architecture identifying the high MIPS functions that need special performance attention
Locality: The locality principle can be used to achieve significant performance improvements by creating actions, functions, and results that are close to physical resources. Some common forms of locality include cache optimizations, where “closeness” can relate to spatial and temporal locality. Other forms include effectual (purpose or intent of a computation) as well as degree. The following chapter on performance optimization will cover this principle in more detail.
Shared resources: Embedded systems are all about the allocation of scarce resources, which include the CPU, memory, and peripherals. Using this principle, resources are shared when possible. When exclusive access to a resource is required, the goal should be to minimize the sum of the holding time and the scheduling time. Resources in embedded systems are limited and software processes compete for their use. Sharing is possible and expected, but the sharing must be managed. Semaphores are a common technique used to manage these scarce resources, but these must be used cautiously.
Parallel processing: With the parallel processing principle, the goal is to execute processing in parallel only when the processing speedup offsets communication overhead and resource contention delays. This principle is all about understanding and applying Amdahl’s law.
Spread the load: Whenever possible, spread the load by processing conflicting loads at different times or in different places. The goal is to address resource contention delay and reduce key system delays by reducing the number of processes that need the resource at any given time. This principle can be used to help partition a software application across a number of processing elements in an SoC. For example, Figure 21 shows an SoC containing several processing elements: programmable DSP cores, a baseband hardware accelerator, and a microcoded network accelerator.
The application can be partitioned according to the “spreading the load principle” by allocating the lower MIPS functions requiring custom software functionality onto the DSP cores, allocating the low complexity software functions with high MIPS requirements onto the baseband hardware accelerator, and spreading the packet processing Ethernet processing onto the microcoded network accelerator.
Click on image to enlarge.
Figure 21. An embedded SoC containing DSP cores, a baseband accelerator, and a network accelerator used to partition a real-time application
Guidelines for using principles of analysis:
*Apply the principles to software components that are critical to performance
Use performance models or benchmarking to quantify the effect of improvements on the overall performance to ensure that improvements are consistent
*Apply the principles until you comply with well defined performance objectives
*Confirm that performance objectives are realistic and that it is cost effective to achieve them
*Create a customized list of examples of each of the principles that is specific to your application domain. Publicize this list so others in your domain may benefit
Document and explain performance improvements using the principles so others on the development team can also gain knowledge in these areas
Performance patterns and anti-patterns
A software pattern is a common solution to a problem that occurs in many different contexts - that is, a general reusable solution to a commonly occurring problem within a given context in software design. A software pattern is not a completely finished design. Think of it as a template for how to solve a problem that can be used in many different situations. Patterns are considered formalized best practices that the programmer must implement themselves in the application. Software patterns draws on best practices in industry.
Performance patterns are at a higher level of abstraction than a design pattern. Here are a few performance patterns proposed by Smith and Williams :
Fast path: This performance pattern is used to reduce the amount of processing for dominant workloads. The classic example of this is the default withdrawal in an ATM machine. Rather than forcing the user through a number of selections for this common function, just give them an opportunity to go right to the common request.
First things first: This performance pattern focuses on prioritizing processing tasks to ensure that important tasks are completed and least important tasks are omitted if necessary. Embedded scheduling techniques include both static and dynamic priorities. The proper scheduling techniques depend on the application. Embedded systems generally demonstrate bursty behavior that could lead to temporary overload. These overload conditions must be managed properly.
One common approach is to use Rate Monotonic Analysis (RMA) and Rate Monotonic Scheduling (RMS) as a scheduling technique when multiple tasks must be completed. The goal is to gracefully degrade and gracefully improve under overload conditions. Rate-monotonic scheduling is a scheduling algorithm used in real-time systems (usually supported in a RTOS) with a static-priority scheduling algorithm. The static priorities are assigned on the basis of the cycle duration of the job. The shorter the cycle duration is, the higher is the job's priority. The RTOS that incorporates this algorithm is generally preemptive and has deterministic guarantees as it relates to response times. Rate monotonic analysis is used in conjunction with those systems to provide scheduling guarantees for embedded applications.
Slender cyclic function: This pattern is used for processing what must occur at regular intervals. This type of processing is common in embedded real-time systems (sensor readings) and can be applied to cyclic or periodic functions. The main problem arises when there are concurrent sources of an event or when other processing needs to happen. The key step is to identify the functions that execute repeatedly at regular intervals and minimize those processing requirements. The goal is to reduce queuing delays in the processing chain. For example, see Figure 22. This shows two different tradeoffs to address the Slender cyclic function in this embedded system example. The single sample approach can reduce latency for incoming data sample but has the disadvantage of being interrupt-intensive, which can increase processing overhead when handling the interrupts. For certain applications, like motor control and noise cancellation, this approach might be the best. For other applications, like cell phones and cellular infrastructure, a buffered approach may be better. This approach leads to increased latency due to the buffering but is computationally more efficient.
Click on image to enlarge.
Figure 22. Two different approaches to processing input samples with tradeoffs of latency and throughput
Anti-patterns are defined as patterns that may be commonly used but are ineffective and/or counterproductive in practice. Anti-patterns are common mistakes during software development. They are different from a simple oversight or mistake as they represent some repeated pattern of action or a process that initially appears to be beneficial, but ultimately produces more bad consequences than beneficial results, and an alternative solution exists that is clearly documented, proven in actual practice and repeatable (but not being used).
One well known anti-pattern is referred to as the “God” class. Often a God class (GC) is created by accident as functionalities are incrementally added to a central software component over the course of its evolution. This component ends up being the dumping ground for many miscellaneous things. One of these symptom is using far too many global variables to store state information. The God class is found in designs where one class monopolizes the processing, and other classes primarily encapsulate data. The consequences of this anti-pattern include a component with large number of methods/attributes, or both; a single controller class that performs all or most of the work; maintainability Issues; reuse can be difficult; and performance (memory) issues.
Traffic Jam: This type of anti-pattern can occur when there is more traffic than resources, or when this is close to the limit (e.g. highway between Austin and Dallas). Transient behavior produces a wide variability in the response times, taking a long time to return to the normal operation. The solution could be to spread the load, or deter some of the load (alternative routes or flex time might help). The developer must know the limits of scalability of the system before building it and plan for handling overload situations smoothly.
Another example of an anti-pattern can be found in the performance engineering area. The “One lane bridge” is an anti-pattern that requires all processing/data to go through one path (or one lane bridge) which will decrease performance. This is solved by providing additional paths. Figure 23 is an example of this. For 'cars' that have a special access pass or in this case a toll tag, an extra path is provided so that these cars do not have to wait to pay in the 'cash' lines. In embedded software we can provide the same solution. Figure 24 is an another example of software that provides a separate path around the Linux kernel so that packets of data that do not need to go through the kernel can be routed around the kernel and directly to user space. This can increase performance up to 7X depending on the type of bypass technology that is being used.
Click on image to enlarge.
Figure 23. A real life example of a “fast path” for improved performance and latency
Click on image to enlarge.
Figure 24. Data flow showing how a fast path architecture (left) “short circuits” designated data flows and increases performance