Tricks and techniques for performance tuning your embedded system using patterns: Part 1

Peter Barry and Gerard Hartnett, Intel Corp.

June 04, 2007

Peter Barry and Gerard Hartnett, Intel Corp.June 04, 2007

Performance tuning is one of the black arts of embedded system development. You will almost certainly spend some portion of your development schedule on optimization and performance activities. Unfortunately, these activities usually seem to occur when the ship-date is closing in and everyone is under the most pressure.

However, help is at hand. We have helped a number of customers tune many diverse applications. From these real-life experiences we have developed a useful toolbox of tricks and techniques for performance tuning, which are summarized in this two part series.

These best-known methods appear in "pattern" form. Part 1 will deal with some general approaches and patterns while Part 2 will cover the types of patterns you will run into in typical networking applications as well as some general code tuning guidelines. Part 3 deals with some general code tuning guidelines. While many of the patterns are generally applicable on any performance-tuning work, some are specific to Intel XScale core and the Intel IXP4XX network processors.

What are Patterns?
Each performance improvement suggestion is documented in the form of a pattern (Alexander 1979) (Gamma et al. 1995). A pattern is "a solution to a problem in a context," a literary mechanism to share experience and impart solutions to commonly occurring problems. Each pattern contains these elements:

Name - for referencing a problem/solution pairing.
Context - the circumstance in which we solve the problem that imposes constraints on the solution. Problem - the specific problem to be solved.
Solution - the proposed solution to the problem. Many problems can have more than one solution and the "goodness" of a solution to a problem is affected by the context in which the problem occurs. Each solution takes certain forces into account. It resolves some forces at the expense of others. It may even ignore some forces.
Forces - the often contradictory considerations we must take into account when choosing a solution to a problem.

A pattern language is the organization and combination of a number of interrelated patterns. Where one pattern references another pattern we use the following format: "(see Pattern Name)."

You may not need to read each and every pattern. You certainly do not need to apply all of the patterns to every performance optimization task. You might however find it useful to scan all of the context and problem statements to get an overview of what is available in this pattern language.

This first set of patterns proposes general approaches and tools you might use when embarking on performance tuning work. These patterns are not specific to any processor or application.

Defined Performance Requirement
Context: You are a software developer starting a performance improvement task on an application or driver.

Problem: Performance improvement work can become a never-ending task. Without a goal, the activity can drag on longer than productive or necessary.

Solution: At an early stage of the project or customer engagement, define a relevant specific, realistic, and measurable performance requirement. Document that performance requirement as a specific detailed application and configuration with a numerical performance target. "Make it as fast as possible" is not a specific performance requirement.  Also, "the application must be capable of 10-gigabit per second wire-speed routing of 64-byte packets with a 266-megahertz CPU" is not a realistic performance requirement.

Forces: You might be inclined to avoid defining a requirement because:
1) A performance target can be hard to define.
2) Waiting to have a goal might affect your product's competitiveness.
3) A performance target can be a moving target; competitors do not stand still. New competitors come along all the time.
4) Without a goal, the performance improvement work can drag on longer than productive.

Performance Design
Context: You are a software developer designing a system. You have a measurable performance requirement (see Defined Performance Requirement).
Problem: The design of the system does not meet the performance requirement.
Solution: At design time, describe the main data path scenario. Walk through the data path in the design workshop and document it in the high level design.

When you partition the system into components allocate a portion of the clock cycles to the data path portion of each component. Have a target at design time for the clock cycle consumption of the whole data path. Ganssle (1999) gives notations and techniques for system design performance constraints.

During code inspections, hold one code inspection that walks through the most critical data path.

Code inspections are usually component-based. This code inspection should be different and follow the scenario of the data path.

If you use a polling mechanism, ensure the CPU is shared appropriately. It can also be useful to analyze the application's required bus bandwidth at design time to decide if the system will be CPU or memory bandwidth/latency limited.

Forces: You might be deterred from making code inspections because:
1) It can be difficult to anticipate some system bottlenecks at design time.
2) The design of the system can make it impossible to meet the performance requirement. If you discover this late in the project, it might be too difficult to do anything about it.

Premature Code Tuning Avoided
Context: You are implementing a system, and you are in the coding phase of the project. You do have a good system-level understanding of the performance requirements and the allocation of performance targets to different parts of the system because you have a performance design (see Performance Design).

Problem: It is difficult to know how much time or effort to spend thinking about performance or efficiency when initially writing the code.

Solution: It is important to find the right balance between performance, functionality, and maintainability. Some studies have found 20 percent of the code consumes 80 percent of the execution time; others have found less than 4 percent of the code accounts for 50 percent of the time (McConnell 1993).

KISS - Keep it simple. Until you have measured and can prove a piece of code is a system-wide bottleneck, do not optimize it. Simple design is easier to optimize. The compiler finds it easier to optimize simple code.

If you are working on a component of a system, you should have a performance budget for your part of the data path (see Performance Design).

In the unit test, you could have a performance test for your part of the data path. At integration time, the team could perform a performance test for the complete assembled data path.

"The best is the enemy of the good. Working toward perfection may prevent completion. Complete it first, then perfect it. The part that needs to be perfect is usually small."
— Steve McConnell.

For further information, see Chapters 28 and 29 of Code Complete (McConnell 1993) and question 20.13 in the comp.lang.c FAQ Web site (Summit 1995).

Forces: You may find the following things troublesome:
1) Efficient code is not necessarily "better" code. It might be difficult to understand and maintain.
2) It is almost impossible to identify performance bottlenecks before you have a working system.
3) If you spend too much time doing micro-optimization during initial coding, you might miss important global optimizations.
4) If you look at performance too late in a project, it can be too late to do anything about it.

Step-by-Step Records
Context: You are trying a number of optimizations to fix a particular bottleneck. The system contains a number of other bottlenecks.

Problem: Sometimes it is difficult when working at a fast pace to remember optimizations made only a few days earlier.

Solution: Take good notes of each experiment you have tried to identify bottlenecks and each optimization you have tried to increase performance. These notes can be invaluable later. You might find you are stuck at a performance level with an invisible bottleneck. Reviewing your optimization notes might help you identify incorrect paths taken or diversionary assumptions.

When a performance improvement effort is complete, it can be very useful to have notes on the optimization techniques that worked. You can then put together a set of best-known methods to help other engineers in your organization benefit from your experience.

Forces: Writing notes can sometimes break the flow of work or thought.

Slam-Dunk Optimization
Context: You have made a number of improvements that have increased the efficiency of code running on the Intel XScale core.

Problem: The latest optimizations have not increased performance. You have hit some unidentified performance-limiting factor. You might have improved performance to a point where environmental factors, protocols, or test equipment are now the bottleneck.

Solution: It is useful to have a code modification identified that you know should improve performance, for example:

1) An algorithm on the data path that can be removed temporarily such as IP checksum.
2) Increase the processor clock speed.

In one application, we implemented a number of optimizations that should have improved performance but did not. We then removed the IP checksum calculation and performance still did not increase.

These results pointed to a hidden limiting factor, an unknown bottleneck. When we followed this line of investigation, we found a problem in the way we configured a physical layer device, and when we fixed this hidden limiting factor, performance improved immediately by approximately 25 percent. We retraced our steps and reapplied the earlier changes to identify the components of that performance improvement.

Forces: Increasing the processor clock speed improves performance only for CPU bound applications.

Best Compiler for Application
Context: You are writing an application using a compiler. You have a choice of compilers for the processor architecture you are using.

Problem: Different compilers generate code that has different performance characteristics. You need to select the right one for your application and target platform.

Solution: Experiment with different compilers and select the best performing compiler for your application and environment.

Performance can vary between compilers. For example, we ran the Dhrystone MIPS benchmark on a 533-megahertz Intel IXP425 Network Processor. The following compilers are listed in order of their relative performance, at the time of writing:

1) Greenhills v3.61 (Green Hills Software, Inc.)
2) ADS v1.2 (ARM Ltd.)
3) Tornado 2.2 Diab (Wind River Systems)
4) Tornado 2.2 GCC (Wind River Systems)

The Intel XScale core GNU-PRO compiler, when measuring the Java CaffineMark benchmark, generates approximately 10-percent better performance than the open source GCC compiler. For more information about the GNU-PRO compiler, refer to the Intel Developer Web site.

Intel is currently working to improve the GNU-PRO compiler and is expecting to further optimize its code generation for the Intel XScale core.

Forces: When selecting a compiler, you might find that:
1) Some compilers are more expensive than others.

2) Some compilers and operating systems might not match. For example, the compiler you want to use generates the wrong object file format for your tool chain or development environment.

3) A particular compiler might optimize a particular benchmark better than another compiler, but that is no guarantee it will optimize your specific application in the same way.

4) You might be constrained in your compiler choice because of tools support issues. If you are working in the Linux kernel, you might have to use GCC. Some parts of the kernel use GCC-specific extensions.

Compiler Optimizations
Context: You have chosen to use a C compiler (see Best Compiler for Application).

Problem: You have not enabled all of the compiler optimizations.

Solution: Your compiler supports a number of optimization switches. Using these switches can increase global application performance for a small amount of effort. Read the documentation for your compiler and understand these switches. In general, the highest-level optimization switch is the -O switch. In GCC, this switch takes a numeric parameter. Find out the maximum parameter for your compiler and use it.

Typical compilers support three levels of optimization. Try the highest. In GCC, the highest level is -O3. However in the past "O3 code generation had more bugs than "O2, the most-used optimization level. The Linux kernel is compiled with "O2. If you have problems at "O3 you might need to revert to "O2.

Moving from -O2 to -O3 made an improvement of approximately 15 percent in packet processing in one application tested. In another application, -O3 was slower than -O2.

You can limit the use of compiler optimizations to individual C source files. Introduce optimization flags, one by one, to discover the ones that give you benefit.

Other GCC optimization flags that can increase performance are:
-fomit-frame-pointer -mapcs

Forces: You may find the following things troublesome:
1) Optimizations increase generated code size.
2) Some optimizations might not increase performance.
3) Compilers support a large number of switches and options. It can be time consuming to read the lengthy documentation.
4) Optimized code is difficult to debug.
5) Some optimizations can reveal compiler bugs. For example, "fomit-frame-pointer "mapcs reveals a post-increment bug in GCC 2.95.X for the Intel XScale core.
6) Enabling optimization can change timings in your code. It might reveal latent undiscovered problems.

Data Cache
Context: You are using a processor that contains a data cache. The core is running faster than memory or peripherals.

Problem: The processor core is spending a significant amount of time stalled waiting on an external memory access. You have identified this problem using the performance monitoring function on your chosen processor to quantify the number of cycles for which the processor is stalled. In some applications, we have observed a significant number of cycles are lost to data-dependency stalls.

Solution: In general, the most efficient mechanism for accessing memory is to use the data cache. Core accesses to cached memory do not need to use the internal bus, leaving it free for other devices. In addition, accessing data in cache memory is faster than accessing it from the SDRAM.

The cache unit can make efficient use of the internal bus. On the IXP4XX network processor the core fetches an entire 32-byte cache line, making use of multiple data phases. This fetch reduces the percentage of overhead cycles required for initiating and terminating bus cycles, when compared with issuing multiple bus cycles to read the same 32 bytes of data without using the cache.

The cache supports several features that give you flexibility in tailoring the system to your design needs. These features affect all applications to some degree; however the optimal settings are application-dependent. It is critical to understand the effects of these features and how to fine-tune them for the usage-model of a particular application. We cover a number of these cache features in later sections.

In one application that was not caching buffer descriptors and packet data, developers enabled caching and saw an approximate 25-percent improvement in packet-processing performance.

Tornado 2.1.1 netBufLib does not allocate packet descriptors and packet memory from cached memory. Later versions of Tornado fixed this issue for the IXP42X product line.

Choose data structures appropriate for a data cache. For example, stacks are typically more cache efficient than linked list data structures.

In most of the applications we have seen, the instruction cache is very efficient. It is worth spending time optimizing the use of the data cache.

Forces: Consider the following things:
1) If you cache data-memory that the core shares with another bus master, you must manage cache flush/invalidation explicitly.

2) If you use caching, you need to ensure no two-packet buffers ever share the same cache line. Bugs that are difficult to diagnose may result.

3) Be careful what you cache. Temporal locality refers to the amount of time between accesses to the data. If you access a piece of data once or access it infrequently, it has low temporal locality.

If you mark this kind of data for caching, the cache replacement algorithm can cause the eviction of performance sensitive data by this lower priority data.

5) The processor implements a round-robin line replacement algorithm.

The techniques discussed here are suggested solutions to the problems proposed, and as such, they are provided for informational purposes only. Neither the authors nor Intel Corporation can guarantee that these proposed solutions will be applicable to your application or that they will resolve the problems in all instances.

The performance tests and ratings mentioned here were measured using specific computer systems and/or components, and the results reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration can affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing.

Next in Part 2:  Patterns for use in networking applications

This article was excerpted from Designing Embedded Networking Applications, by Peter Barry and Gerard Hartnett and published by Intel Press. Copyright © 2005 Intel Corporation. All rights reserved.

Peter Barry and Gerard Hartnett are senior Intel engineers. Both have been leaders in the design of Intel network processor platforms and are regularly sought out for their expert guidance on network components.

(Alexander 1979) Alexander, Christopher 1979. The Timeless Way of Building. Oxford University Press
(Gamma et al. 1995) Gamma, Erich, Richard Helm, Ralph Johnson and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Addison Wesley
(Ganssle 1999) Ganssle Jack. 1999. The Art of Designing Embedded Systems. Newnes-Elsever
(McConnell 1993) McConnell. 1993. Code Complete: A Practical Handbook of Software Construction. Microsoft Press

Reader Response

Check out chapters on SW Performance Patterns and Anti-patterns - goto and book by Dr. Connie U. Smithand Dr. Lloyd G. Williams have book titled Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software, published by Addision-Wesley.

-Craig Fullerton
Chandler, AZ

Loading comments...