Software performance engineering for embedded systems Part 3 – Collecting data and using it effectively - Embedded.com

Software performance engineering for embedded systems Part 3 – Collecting data and using it effectively

Editor's note: In the final part in a series of three articless excerpted from his soon-to-be published “Software Engineering for Embedded Systems”, Freescale’s Robert Oshana describes the various techniques for collecting performance data, assessing it, and then generating benchmarks that will allow you to monitor your system for effective operation.

Collecting performance data can be difficult if not planned correctly. There are many factors that may affect the measurement of data including:
System perturbation
Capture ratios
System overhead
Measurement timing
Reproducible results
Representative time periods
Averages for typical behavior
Workload generation

One effective means to produce accurate, well-timed data for analysis is to design probes into the software being developed. It is much easier to define measurement requirements and probe points while the software designers are defining the system architecture. Collecting and analyzing data tends to incur less processing overhead if the probes are integrated as the design evolves.

Principles of analysis
Connie Smith and Lloyd Williams describe a set of analysis principles that are helpful for developing and managing a performance program [4]. The principles that are most relevant for embedded systems development are described below.

Performance control:  Using this principle, specific, quantitative, measurable performance objectives for the key system performance scenarios are defined. Avoid vague or qualitative performance objectives, as these are difficult to measure and it's hard to know when you have met the goal.

Instrumenting:
This principle states that engineering should strive to instrument systems as they build them to enable measurement and analysis of workload scenarios, resource requirements, and performance objective compliance.

Centering:
This principle focuses on identifying the frequently used functions, which are called “dominant workload functions”,) and minimize their processing. The focus should be on those parts of the system with the greatest impact. Generally the 80/20 rule applies to this principle. Figure 20 shows a high level software architecture diagram for an eNodeB application. Software layers 1, 2, and 3 are highlighted in areas where dominant workload functions reside. These areas are where we focus most of our attention.

Click on image to enlarge.


Figure 20. A high level model of an eNodeB Layer 1/2/3 software architecture identifying the high MIPS functions that need special performance attention

Locality:  The locality principle can be used to achieve significant performance improvements by creating actions, functions, and results that are close to physical resources. Some common forms of locality include cache optimizations, where “closeness” can relate to spatial and temporal locality. Other forms include effectual (purpose or intent of a computation) as well as degree. The following chapter on performance optimization will cover this principle in more detail.

Shared resources:
 Embedded systems are all about the allocation of scarce resources, which include the CPU, memory, and peripherals. Using this principle, resources are shared when possible. When exclusive access to a resource is required, the goal should be to minimize the sum of the holding time and the scheduling time. Resources in embedded systems are limited and software processes compete for their use. Sharing is possible and expected, but the sharing must be managed. Semaphores are a common technique used to manage these scarce resources, but these must be used cautiously.

Parallel processing:
With the parallel processing principle, the goal is to execute processing in parallel only when the processing speedup offsets communication overhead and resource contention delays. This principle is all about understanding and applying Amdahl’s law.

Spread the load:  Whenever possible, spread the load by processing conflicting loads at different times or in different places. The goal is to address resource contention delay and reduce key system delays by reducing the number of processes that need the resource at any given time. This principle can be used to help partition a software application across a number of processing elements in an SoC. For example, Figure 21 shows an SoC containing several processing elements: programmable DSP cores, a baseband hardware accelerator, and a microcoded network accelerator.

The application can be partitioned according to the “spreading the load principle” by allocating the lower MIPS functions requiring custom software functionality onto the DSP cores, allocating the low complexity software functions with high MIPS requirements onto the baseband hardware accelerator, and spreading the packet processing Ethernet processing onto the microcoded network accelerator.

Click on image to enlarge.


Figure 21. An embedded SoC containing DSP cores, a baseband accelerator, and a network accelerator used to partition a real-time application

Guidelines for using principles of analysis:
*Apply the principles to software components that are critical to performance
Use performance models or benchmarking to quantify the effect of improvements on the overall performance to ensure that improvements are consistent
*Apply the principles until you comply with well defined performance objectives
*Confirm that performance objectives are realistic and that it is cost effective to achieve them
*Create a customized list of examples of each of the principles that is specific to your application domain. Publicize this list so others in your domain may benefit
Document and explain performance improvements using the principles so others on the development team can also gain knowledge in these areas

Performance patterns and anti-patterns
A software pattern is a common solution to a problem that occurs in many different contexts – that is, a general reusable solution to a commonly occurring problem within a given context in software design. A software pattern is not a completely finished design. Think of it as a template for how to solve a problem that can be used in many different situations. Patterns are considered formalized best practices that the programmer must implement themselves in the application. Software patterns draws on best practices in industry.
Performance patterns are at a higher level of abstraction than a design pattern. Here are a few performance patterns proposed by Smith and Williams [4]:

Fast path:
 This performance pattern is used to reduce the amount of processing for dominant workloads. The classic example of this is the default withdrawal in an ATM machine. Rather than forcing the user through a number of selections for this common function, just give them an opportunity to go right to the common request.

First things first: This performance pattern focuses on prioritizing processing tasks to ensure that important tasks are completed and least important tasks are omitted if necessary. Embedded scheduling techniques include both static and dynamic priorities. The proper scheduling techniques depend on the application. Embedded systems generally demonstrate bursty behavior that could lead to temporary overload. These overload conditions must be managed properly.

One common approach is to use Rate Monotonic Analysis (RMA) and Rate Monotonic Scheduling (RMS) as a scheduling technique when multiple tasks must be completed. The goal is to gracefully degrade and gracefully improve under overload conditions. Rate-monotonic scheduling is a scheduling algorithm used in real-time systems (usually supported in a RTOS) with a static-priority scheduling algorithm. The static priorities are assigned on the basis of the cycle duration of the job. The shorter the cycle duration is, the higher is the job's priority. The RTOS that incorporates this algorithm is generally preemptive and has deterministic guarantees as it relates to response times. Rate monotonic analysis is used in conjunction with those systems to provide scheduling guarantees for embedded applications.

Slender cyclic function:  This pattern is used for processing what must occur at regular intervals. This type of processing is common in embedded real-time systems (sensor readings) and can be applied to cyclic or periodic functions. The main problem arises when there are concurrent sources of an event or when other processing needs to happen. The key step is to identify the functions that execute repeatedly at regular intervals and minimize those processing requirements. The goal is to reduce queuing delays in the processing chain. For example, see Figure 22 . This shows two different tradeoffs to address the Slender cyclic function in this embedded system example. The single sample approach can reduce latency for incoming data sample but has the disadvantage of being interrupt-intensive, which can increase processing overhead when handling the interrupts. For certain applications, like motor control and noise cancellation, this approach might be the best. For other applications, like cell phones and cellular infrastructure, a buffered approach may be better. This approach leads to increased latency due to the buffering but is computationally more efficient.

Click on image to enlarge.


Figure 22. Two different approaches to processing input samples with tradeoffs of latency and throughput

Anti-patterns
Anti-patterns are defined as patterns that may be commonly used but are ineffective and/or counterproductive in practice. Anti-patterns are common mistakes during software development. They are different from a simple oversight or mistake as they represent some repeated pattern of action or a process that initially appears to be beneficial, but ultimately produces more bad consequences than beneficial results, and an alternative solution exists that is clearly documented, proven in actual practice and repeatable (but not being used).

One well known anti-pattern is referred to as the “God” class. Often a God class (GC) is created by accident as functionalities are incrementally added to a central software component over the course of its evolution. This component ends up being the dumping ground for many miscellaneous things. One of these symptom is using far too many global variables to store state information. The God class is found in designs where one class monopolizes the processing, and other classes primarily encapsulate data. The consequences of this anti-pattern include a component with large number of methods/attributes, or both; a single controller class that performs all or most of the work; maintainability Issues; reuse can be difficult; and performance (memory) issues.

Traffic Jam:  This type of anti-pattern can occur when there is more traffic than resources, or when this is close to the limit (e.g. highway between Austin and Dallas). Transient behavior produces a wide variability in the response times, taking a long time to return to the normal operation. The solution could be to spread the load, or deter some of the load (alternative routes or flex time might help). The developer must know the limits of scalability of the system before building it and plan for handling overload situations smoothly.

Another example of an anti-pattern can be found in the performance engineering area. The “One lane bridge” is an anti-pattern that requires all processing/data to go through one path (or one lane bridge) which will decrease performance. This is solved by providing additional paths. Figure 23 is an example of this. For 'cars' that have a special access pass or in this case a toll tag, an extra path is provided so that these cars do not have to wait to pay in the 'cash' lines. In embedded software we can provide the same solution. Figure 24 is an another example of software that provides a separate path around the Linux kernel so that packets of data that do not need to go through the kernel can be routed around the kernel and directly to user space. This can increase performance up to 7X depending on the type of bypass technology that is being used.

Click on image to enlarge.


Figure 23. A real life example of a “fast path” for improved performance and latency

Click on image to enlarge.


Figure 24. Data flow showing how a fast path architecture (left) “short circuits” designated data flows and increases performance



Software performance optimization

Bart Smaalders [5] does a good job of summarizing some common mistakes in software performance optimization:

Fixing Performance at the End of the Project:
 failureto formulate performance goals or benchmarks, and waiting until late inthe project to measure and address performance issues will almostguarantee project delays or failure.

Measuring and Comparing the Wrong Things:
A common mistake is to benchmark the wrong thing and be surprised later. Don’t ignore competitive realities.

Good benchmarks
Smaalders also defines what a good benchmark is:
*Repeatable, so experiments of comparison can be conducted relatively easily and with a reasonable degree of precision.
*Observable.If poor performance is observed, the developer has some breadcrumbs tostart looking for. Complex benchmark should not deliver a single number,which gives the developer no additional information as to whereperformance problems might be. The Embedded Microprocessor BenchmarkConsortium (EEMBC) does a good job of providing not just benchmarkresults but also the compiler options used, the version of the software,etc. This additional data is useful when comparing benchmarks.
*Portable.comparisons must be performed with your competitors and even your ownprevious releases. Maintaining a history of the performance of previousreleases is a valuable tool to understanding your own developmentprocess.
*Easily understood. All relevant stakeholders should understand the comparisons in a brief presentation.
*Realistic. Measurements need to reflect customer-experienced realities and use cases.
*Runnable.All developers must quickly ascertain the effects of their changes. Ifit takes days to get performance results, it won’t happen very often.

Mistakes to avoid

Avoidselecting benchmarks that don’t really represent your customer, becauseyour team will end up optimizing for the wrong behavior. Resist thetemptation to optimize for the benchmark as this is only a short term“feel good” but will not translate to reality.

Algorithmic Antipathy:
Algorithm selection involves having realistic benchmarks and workloadsto help make intelligent decisions based on real data rather thanintuition or other guesswork. The best time to do performance analysiswork is in the earlier phases of the project. This is usually theopposite of what actually happens. Clever compilation options and Clevel optimizations are ineffective when dealing with O(n2) algorithmsespecially for large values of n. Poor algorithms selection are aprimary cause of poor software system performance. Figure 25 shows a comparison of a DFT algorithm with algorithm complexity O(n**2)and a FFT algorithm with complexity O(n log n). As the figure shows, theperformance of the FFT improves as the number of data points in thetransform.

Click on image to enlarge.


Figure 25. A DFT algorithm versus FFT showing algorithm complexity has a big impact on performance

Reusing Software: Software reuse is a noble goal, but the development staff must becognizant of violating the assumptions made during development of thesoftware being reused. If the software is not designed or optimized forthe new use cases in which it will be used, then there will be surpriseslater in the development process.

Iterating Because That’s What Computers Do Well:  Ifyour embedded application is doing unneeded or unappreciated work, forexample computing statistics too frequently, then eliminating such wasteis a lucrative area for performance work. Keep in mind that whatmatters most is the end state of the program, not the exact series ofsteps used to get there. Often a shortcut is available that will allowus to reach the goal more quickly. Smaalders describes this as likeshortening the race course rather than speeding up the car: With a fewexception such as correctly used memory prefetch instructions, the onlyway to go faster in software is to do less.

Premature and Excessive Optimization:  softwarethat is carefully tuned and optimized is fine, but if thesehand-unrolled loops, register declarations, inline functions, assemblylanguage inner loops, and other optimizations are only contributing asmall overall improvement to the system performance (usually becausethey are not in a critical path of the software use case) then this is awaste of time and not worth the ROI. It's important to understand wherethe hot spots are before focusing the tuning effort. Sometimespremature optimization can actually adversely affect performance onbenchmarks by increasing the instruction cache footprint enough to causemisses and pipeline stalls, or by confusing the register allocator inthe compiler.

As Smaalders describes, low-level cycle shavinghas its place, but only at the end of the performance effort, not duringinitial code development. Donald Knuth is quoted as saying “Prematureoptimization is the root of all evil.” Excessive optimizations is justas bad. There is a diminishing return associated with optimization. Thedeveloper needs to understand when it's time to stop (e.g. when is thegoal met and when can we ship?). Figure 26  shows an example ofthis. This algorithm performance benchmarks starts at 521 cycles with“out of box” C code, and gets progressively better as the algorithm isfurther optimized using intrinsic, hand assembly for internal loops andfull assembly. Developers need to understand where the curve starts toflatten and further performance becomes harder to obtain.

Click on image to enlarge.

Figure 26. There are diminishing returns to performance optimization

Focusing on What You Can See Rather Than on the Problem: Each line of code at the top level of the application causes, ingeneral, large amounts of work elsewhere farther down in the softwarestack. As a result, inefficiencies at the top layer have a largemultiplier magnifying their impact, making the top of the stack a goodplace to look for possible speed-ups.

Software Layering: Software developers use layering to provide various levels ofabstraction in their software. This can be useful at times but there areconsequences. Improper abstraction can increase the stack data cachefootprint, TLB (translation look-aside buffer) misses, as well asfunction call overhead. Too much data hiding may also lead to too anexcessive number of arguments for function calls as well as thepotential creation of new structures to hold additional arguments. Thisproblem becomes exacerbated when it is not fixed and the software isdeployed to the field. Once there are multiple users of a new layer ofsoftware, modifications become more difficult and the performancetrade-offs tend to accumulate over time.

Excessive Numbers of Threads:  Softwarethreads are familiar to embedded developers. A common mistake is using adifferent thread for each unit of pending work. Although this can be asimple to implement programming model, it can lead to performanceproblems if taken to an extreme. The goal is to limit the number ofthreads to a reasonable number (the number of CPUs) and to use some ofthe programming guidelines mentioned in the chapter on MulticoreSoftware for Embedded Systems.

Asymmetric Hardware Utilization: Embedded CPUs are much faster than the memory systems connected tothem. Embedded processor designs these days use multiple levels ofcaches to hide the latency of memory accesses, and multilevel TLBs arebecoming common in embedded systems as well. These caches and TLBs usevarying degrees of associativity to spread the application load acrossthe caches, but this technique is often accidentally thwarted by otherperformance optimizations. Iteration and analysis is important tounderstand these potential side effects to system performance.

NotOptimizing for the Common Case: We spoke about this earlier. It'simportant to identify the performance use cases and focus theoptimization efforts on these important performance drivers.

Read Part 1, “What is SPE?”
Read Part 2, “The importance of performance measurements”

Rob Oshana ,author of the soon to be published “Software engineering for embeddedsystems,” by Elsevier, is director of software R&D, NetworkingSystems Group, Freescale Semiconductor.

Used with permissionfrom Morgan Kaufmann, a division of Elsevier, Copyright 2012. Moreinformation about “Software engineering for embedded systems,” and othersimilar books, go here .

References

1.A Maturity Model for Application Performance Management ProcessEvolution, A model for evolving organization’s application performancemanagement process, By Shyam Kumar Doddavula, Nidhi Timari, and AmitGawande

2. Five Steps to Solving Software Performance Problems, Lloyd G. Williams, Ph.D.Connie U. Smith, Ph.D. June, 2002

3.Software Performance Engineering, in UML for Real: Design of EmbeddedReal-Time Systems, Luciano Lavagno, Grant Martin, Bran Selic ed.,Kluwer, 2003.

4. Performance Solutions: A Practical Guide toCreating Responsive, Scalable Software, Lloyd G. Williams, Ph.D.ConnieU. Smith, Ph.D.

5. Performance Anti-Patterns, Want your apps to run faster? Here’s what not to do. Bart Smaalders, Sun Microsystems


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.