Software performance engineering for embedded systems Part 3 – Collecting data and using it effectively
Software performance optimization
Bart Smaalders [5] does a good job of summarizing some common mistakes in software performance optimization:
Fixing Performance at the End of the Project: failure to formulate performance goals or benchmarks, and waiting until late in the project to measure and address performance issues will almost guarantee project delays or failure.
Measuring and Comparing the Wrong Things: A common mistake is to benchmark the wrong thing and be surprised later. Don’t ignore competitive realities.
Good benchmarks
Smaalders also defines what a good benchmark is:
*Repeatable, so experiments of comparison can be conducted relatively easily and with a reasonable degree of precision.
*Observable. If poor performance is observed, the developer has some breadcrumbs to start looking for. Complex benchmark should not deliver a single number, which gives the developer no additional information as to where performance problems might be. The Embedded Microprocessor Benchmark Consortium (EEMBC) does a good job of providing not just benchmark results but also the compiler options used, the version of the software, etc. This additional data is useful when comparing benchmarks.
*Portable. comparisons must be performed with your competitors and even your own previous releases. Maintaining a history of the performance of previous releases is a valuable tool to understanding your own development process.
*Easily understood. All relevant stakeholders should understand the comparisons in a brief presentation.
*Realistic. Measurements need to reflect customer-experienced realities and use cases.
*Runnable. All developers must quickly ascertain the effects of their changes. If it takes days to get performance results, it won’t happen very often.
Mistakes to avoid
Avoid selecting benchmarks that don’t really represent your customer, because your team will end up optimizing for the wrong behavior. Resist the temptation to optimize for the benchmark as this is only a short term “feel good” but will not translate to reality.
Algorithmic Antipathy: Algorithm selection involves having realistic benchmarks and workloads to help make intelligent decisions based on real data rather than intuition or other guesswork. The best time to do performance analysis work is in the earlier phases of the project. This is usually the opposite of what actually happens. Clever compilation options and C level optimizations are ineffective when dealing with O(n2) algorithms especially for large values of n. Poor algorithms selection are a primary cause of poor software system performance. Figure 25 shows a comparison of a DFT algorithm with algorithm complexity O(n**2) and a FFT algorithm with complexity O(n log n). As the figure shows, the performance of the FFT improves as the number of data points in the transform.

Click on image to enlarge.
Figure 25. A DFT algorithm versus FFT showing algorithm complexity has a big impact on performance
Reusing Software: Software reuse is a noble goal, but the development staff must be cognizant of violating the assumptions made during development of the software being reused. If the software is not designed or optimized for the new use cases in which it will be used, then there will be surprises later in the development process.
Iterating Because That’s What Computers Do Well: If your embedded application is doing unneeded or unappreciated work, for example computing statistics too frequently, then eliminating such waste is a lucrative area for performance work. Keep in mind that what matters most is the end state of the program, not the exact series of steps used to get there. Often a shortcut is available that will allow us to reach the goal more quickly. Smaalders describes this as like shortening the race course rather than speeding up the car: With a few exception such as correctly used memory prefetch instructions, the only way to go faster in software is to do less.
Premature and Excessive Optimization: software that is carefully tuned and optimized is fine, but if these hand-unrolled loops, register declarations, inline functions, assembly language inner loops, and other optimizations are only contributing a small overall improvement to the system performance (usually because they are not in a critical path of the software use case) then this is a waste of time and not worth the ROI. It's important to understand where the hot spots are before focusing the tuning effort. Sometimes premature optimization can actually adversely affect performance on benchmarks by increasing the instruction cache footprint enough to cause misses and pipeline stalls, or by confusing the register allocator in the compiler.
As Smaalders describes, low-level cycle shaving has its place, but only at the end of the performance effort, not during initial code development. Donald Knuth is quoted as saying “Premature optimization is the root of all evil.” Excessive optimizations is just as bad. There is a diminishing return associated with optimization. The developer needs to understand when it's time to stop (e.g. when is the goal met and when can we ship?). Figure 26 shows an example of this. This algorithm performance benchmarks starts at 521 cycles with “out of box” C code, and gets progressively better as the algorithm is further optimized using intrinsic, hand assembly for internal loops and full assembly. Developers need to understand where the curve starts to flatten and further performance becomes harder to obtain.

Click on image to enlarge.
Figure 26. There are diminishing returns to performance optimization
Focusing on What You Can See Rather Than on the Problem: Each line of code at the top level of the application causes, in general, large amounts of work elsewhere farther down in the software stack. As a result, inefficiencies at the top layer have a large multiplier magnifying their impact, making the top of the stack a good place to look for possible speed-ups.
Software Layering: Software developers use layering to provide various levels of abstraction in their software. This can be useful at times but there are consequences. Improper abstraction can increase the stack data cache footprint, TLB (translation look-aside buffer) misses, as well as function call overhead. Too much data hiding may also lead to too an excessive number of arguments for function calls as well as the potential creation of new structures to hold additional arguments. This problem becomes exacerbated when it is not fixed and the software is deployed to the field. Once there are multiple users of a new layer of software, modifications become more difficult and the performance trade-offs tend to accumulate over time.
Excessive Numbers of Threads: Software threads are familiar to embedded developers. A common mistake is using a different thread for each unit of pending work. Although this can be a simple to implement programming model, it can lead to performance problems if taken to an extreme. The goal is to limit the number of threads to a reasonable number (the number of CPUs) and to use some of the programming guidelines mentioned in the chapter on Multicore Software for Embedded Systems.
Asymmetric Hardware Utilization: Embedded CPUs are much faster than the memory systems connected to them. Embedded processor designs these days use multiple levels of caches to hide the latency of memory accesses, and multilevel TLBs are becoming common in embedded systems as well. These caches and TLBs use varying degrees of associativity to spread the application load across the caches, but this technique is often accidentally thwarted by other performance optimizations. Iteration and analysis is important to understand these potential side effects to system performance.
Not Optimizing for the Common Case: We spoke about this earlier. It's important to identify the performance use cases and focus the optimization efforts on these important performance drivers.
Read Part 1, "What is SPE?"
Read Part 2, "The importance of performance measurements"
Rob Oshana, author of the soon to be published “Software engineering for embedded systems,” by Elsevier, is director of software R&D, Networking Systems Group, Freescale Semiconductor.
Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012. More information about “Software engineering for embedded systems,” and other similar books, go here.
References
1. A Maturity Model for Application Performance Management Process Evolution, A model for evolving organization’s application performance management process, By Shyam Kumar Doddavula, Nidhi Timari, and Amit Gawande
2. Five Steps to Solving Software Performance Problems, Lloyd G. Williams, Ph.D.Connie U. Smith, Ph.D. June, 2002
3. Software Performance Engineering, in UML for Real: Design of Embedded Real-Time Systems, Luciano Lavagno, Grant Martin, Bran Selic ed., Kluwer, 2003.
4. Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software, Lloyd G. Williams, Ph.D.Connie U. Smith, Ph.D.
5. Performance Anti-Patterns, Want your apps to run faster? Here’s what not to do. Bart Smaalders, Sun Microsystems


Loading comments... Write a comment