Tricks and techniques for performance tuning your embedded system using patterns: Part 1

Performance tuning is one of the black arts of embedded systemdevelopment. You will almost certainly spend some portion of yourdevelopment schedule on optimization and performance activities.Unfortunately, these activities usually seem to occur when theship-date is closing in and everyone is under the most pressure.

However, help is at hand. We have helped a number of customers tunemany diverse applications. From these real-life experiences we havedeveloped a useful toolbox of tricks and techniques for performancetuning, which are summarized in this two part series.

These best-known methods appear in “pattern” form. Part 1 will deal with some generalapproaches and patterns whilePart 2 will cover the types of patterns you will run into in typicalnetworking applications as well as some general code tuning guidelines.Part 3 deals with some general code tuning guidelines. While many ofthe patterns are generally applicable on anyperformance-tuning work, some are specific to Intel XScale core and theIntel IXP4XXnetworkprocessors.

What are Patterns?
Each performance improvement suggestion is documented in the form of apattern (Alexander 1979) (Gamma et al. 1995). A pattern is”a solutionto a problem in a context,” a literary mechanism to share experienceand impart solutions to commonly occurring problems. Each patterncontains these elements:

Name -for referencing a problem/solution pairing.
Context -the circumstance in which we solve the problem that imposes constraintson the solution. Problem – the specific problem to be solved.
Solution -the proposed solution to the problem. Many problems can have more thanone solution and the “goodness” of a solution to a problem is affectedby the context in which the problem occurs. Each solution takes certainforces into account. It resolves some forces at the expense of others.It may even ignore some forces.
Forces – the oftencontradictory considerations we must take into account when choosing asolution to a problem.

A pattern language is theorganization and combination of a numberof interrelated patterns. Where one pattern references another patternwe use the following format: “(seePattern Name ).”

You may not need to read each and every pattern. You certainly donot need to apply all of the patterns to every performance optimizationtask. You might however find it useful to scan all of the context andproblem statements to get an overview of what is available in thispattern language.

This first set of patterns proposes general approaches and tools youmight use when embarking on performance tuning work. These patterns arenot specific to any processor or application.

Defined Performance Requirement
Context: You are a software developer starting a performance improvement task onan application or driver.

Problem: Performanceimprovement work can become a never-ending task. Without a goal, theactivity can drag on longer than productive or necessary.

Solution: Atan early stage of the project or customer engagement, define a relevantspecific, realistic, and measurable performance requirement. Documentthat performance requirement as a specific detailed application andconfiguration with a numerical performance target.”Make it as fast as possible “is not a specific performancerequirement.  Also, “theapplication must be capable of 10-gigabit per second wire-speedrouting of 64-byte packets with a 266-megahertz CPU ” is not arealisticperformance requirement.

Forces: Youmight be inclined to avoid defining a requirement because:
1) A performance target can behard to define.
2) Waiting to have a goal mightaffect your product's competitiveness.
3) A performance target can bea moving target; competitors do not stand still. New competitors comealong all the time.
4) Without a goal, theperformance improvement work can drag on longer than productive.

Performance Design
Context: You are a software developer designing a system. You have a measurableperformance requirement (see DefinedPerformance Requirement ).
Problem: The design of the system does not meet the performance requirement.
Solution: At design time,describe the main data path scenario. Walk through the data path in thedesign workshop and document it in the high level design.

When you partition the system into components allocate a portion ofthe clock cycles to the data path portion of each component. Have atarget at design time for the clock cycle consumption of the whole datapath. Ganssle (1999) givesnotations and techniques for system designperformance constraints.

During code inspections, hold one code inspection that walks throughthe most critical data path.

Code inspections are usually component-based. This code inspectionshould be different and follow the scenario of the data path.

If you use a polling mechanism, ensure the CPU is sharedappropriately. It can also be useful to analyze the application'srequired bus bandwidth at design time to decide if the system will beCPU or memory bandwidth/latency limited.

Forces: Youmight be deterred from making code inspections because:
1) It can be difficult toanticipate some system bottlenecks at design time.
2) The design of the system canmake it impossible to meet the performance requirement. If you discoverthis late in the project, it might be too difficult to do anythingabout it.

Premature Code Tuning Avoided
Context: You are implementing a system, and you are in the coding phase of theproject. You do have a good system-level understanding of theperformance requirements and the allocation of performance targets todifferent parts of the system because you have a performance design(see Performance Design ).

Problem: It is difficult to know how much time or effort to spend thinking aboutperformance or efficiency when initially writing the code.

Solution: Itis important to find the right balance between performance,functionality, and maintainability. Some studies have found 20 percentof the code consumes 80 percent of the execution time; others havefound less than 4 percent of the code accounts for 50 percent of thetime (McConnell 1993).

KISS – Keep it simple. Until you have measured and can prove a pieceof code is a system-wide bottleneck, do not optimize it. Simple designis easier to optimize. The compiler finds it easier to optimize simplecode.

If you are working on a component of a system, you should have aperformance budget for your part of the data path (see PerformanceDesign).

In the unit test, you could have a performance test for your part ofthe data path. At integration time, the team could perform aperformance test for the complete assembled data path.

“The best is the enemy of the good. Working toward perfection mayprevent completion. Complete it first, then perfect it. The part thatneeds to be perfect is usually small.”
— Steve McConnell.

For further information, see Chapters 28 and 29 of Code Complete(McConnell 1993) and question 20.13 in the comp.lang.c FAQ Web site(Summit 1995).

Forces: You may find the following things troublesome:
1) Efficient code is notnecessarily “better” code. It might be difficult to understand andmaintain.
2) It is almost impossible toidentify performance bottlenecks before you have a working system.
3) If you spend too much timedoing micro-optimization during initial coding, you might missimportant global optimizations.
4) If you look at performancetoo late in a project, it can be too late to do anything about it.

Step-by-Step Records
Context: Youare trying a number of optimizations to fix a particular bottleneck.The system contains a number of other bottlenecks.

Problem: Sometimes it is difficult when working at a fast pace to rememberoptimizations made only a few days earlier.

Solution :Take good notes of each experiment you have tried to identifybottlenecks and each optimization you have tried to increaseperformance. These notes can be invaluable later. You might find youare stuck at a performance level with an invisible bottleneck.Reviewing your optimization notes might help you identify incorrectpaths taken or diversionary assumptions.

When a performance improvement effort is complete, it can be veryuseful to have notes on the optimization techniques that worked. Youcan then put together a set of best-known methods to help otherengineers in your organization benefit from your experience.

Forces :Writing notes can sometimes break the flow of work or thought.

Slam-Dunk Optimization
Context: You have made anumber of improvements that have increased the efficiency of coderunning on the Intel XScale core.

Problem: Thelatest optimizations have not increased performance. You have hit someunidentified performance-limiting factor. You might have improvedperformance to a point where environmental factors, protocols, or testequipment are now the bottleneck.

Solution: Itis useful to have a code modification identified that you know shouldimprove performance, for example:

1) An algorithm on the datapath that can be removed temporarily such as IP checksum.
2) Increase the processor clockspeed.

In one application, we implemented a number of optimizations thatshould have improved performance but did not. We then removed the IPchecksum calculation and performance still did not increase.

These results pointed to a hidden limiting factor, an unknownbottleneck. When we followed this line of investigation, we found aproblem in the way we configured a physical layer device, and when wefixed this hidden limiting factor, performance improved immediately byapproximately 25 percent. We retraced our steps and reapplied theearlier changes to identify the components of that performanceimprovement.

Forces: Increasingthe processor clock speed improves performance only for CPU boundapplications.

Best Compiler for Application
Context: Youare writing an application using a compiler. You have a choice ofcompilers for the processor architecture you are using.

Problem: Different compilers generate code that has different performancecharacteristics. You need to select the right one for your applicationand target platform.

Solution: Experimentwith different compilers and select the best performing compiler foryour application and environment.

Performance can vary between compilers. For example, we ran theDhrystone MIPS benchmark on a533-megahertz Intel IXP425 NetworkProcessor. The following compilers are listed in order of theirrelative performance, at the time of writing:

1) Greenhills v3.61 (GreenHills Software, Inc .)
2) ADS v1.2 (ARM Ltd. )
3) Tornado 2.2 Diab (WindRiverSystems )
4) Tornado 2.2 GCC (Wind RiverSystems)

The Intel XScale core GNU-PRO compiler, when measuring the JavaCaffineMark benchmark, generates approximately 10-percent betterperformance than the open source GCCcompiler. For more informationabout the GNU-PRO compiler, refer to the Intel Developer Web site.

Intel is currently working to improve the GNU-PRO compiler and isexpecting to further optimize its code generation for the Intel XScalecore.

Forces: When selecting a compiler, you might find that:
1) Some compilers are moreexpensive than others.

2) Some compilersand operating systems might not match. For example, the compiler youwant to use generates the wrong object file format for your tool chainor development environment.

3) A particular compilermight optimize a particular benchmark better than another compiler, butthat is no guarantee it will optimize your specific application in thesame way.

4) You might be constrainedin your compiler choice because of tools support issues. If you areworking in the Linux kernel, you might have to use GCC. Some parts ofthe kernel use GCC-specific extensions.

Compiler Optimizations
Context: You have chosen to use a C compiler (see Best Compiler forApplication).

Problem: Youhave not enabled all of the compiler optimizations.

Solution: Your compiler supports a number of optimization switches. Using theseswitches can increase global application performance for a small amountof effort. Read the documentation for your compiler and understandthese switches. In general, the highest-level optimization switch isthe -O switch. In GCC, this switch takes a numeric parameter. Find outthe maximum parameter for your compiler and use it.

Typical compilers support three levels of optimization. Try thehighest. In GCC, the highest level is -O3. However in the past “O3 codegeneration had more bugs than “O2, the most-used optimization level.The Linux kernel is compiled with “O2. If you have problems at “O3 youmight need to revert to “O2.

Moving from -O2 to -O3 made an improvement of approximately 15percent in packet processing in one application tested. In anotherapplication, -O3 was slower than -O2.

You can limit the use of compiler optimizations to individual Csource files. Introduce optimization flags, one by one, to discover theones that give you benefit.

Other GCC optimization flags that can increase performance are:
-funroll-loops -fomit-frame-pointer -mapcs -align-labels=32

Forces: You may find the following things troublesome:
1) Optimizations increasegenerated code size.
2) Some optimizations mightnot increase performance.
3) Compilers support a largenumber of switches and options. It can be time consuming to read thelengthy documentation.
4) Optimized code is difficultto debug.
5) Some optimizations canreveal compiler bugs. For example, “fomit-frame-pointer “mapcs revealsa post-increment bug in GCC 2.95.X for the Intel XScale core.
6) Enabling optimization canchange timings in your code. It might reveal latent undiscoveredproblems.

Data Cache
Context: Youare using a processor that contains a data cache. The core is runningfaster than memory or peripherals.

Problem: The processor core is spending a significant amount of time stalledwaiting on an external memory access. You have identified this problemusing the performance monitoring function on your chosen processor toquantify the number of cycles for which the processor is stalled. Insome applications, we have observed a significant number of cycles arelost to data-dependency stalls.

Solution: In general, the most efficient mechanism for accessing memory is to usethe data cache. Core accesses to cached memory do not need to use theinternal bus, leaving it free for other devices. In addition, accessingdata in cache memory is faster than accessing it from the SDRAM.

The cache unit can make efficient use of the internal bus. On theIXP4XX network processor the core fetches an entire 32-byte cache line,making use of multiple data phases. This fetch reduces the percentageof overhead cycles required for initiating and terminating bus cycles,when compared with issuing multiple bus cycles to read the same 32bytes of data without using the cache.

The cache supports several features that give you flexibility intailoring the system to your design needs. These features affect allapplications to some degree; however the optimal settings areapplication-dependent. It is critical to understand the effects ofthese features and how to fine-tune them for the usage-model of aparticular application. We cover a number of these cache features inlater sections.

In one application that was not caching buffer descriptors andpacket data, developers enabled caching and saw an approximate25-percent improvement in packet-processing performance.

Tornado 2.1.1 netBufLib does not allocate packet descriptors andpacket memory from cached memory. Later versions of Tornado fixed thisissue for the IXP42X product line.

Choose data structures appropriate for a data cache. For example,stacks are typically more cache efficient than linked list datastructures.

In most of the applications we have seen, the instruction cache isvery efficient. It is worth spending time optimizing the use of thedata cache.

Forces: Consider the following things:
1) If you cache data-memorythat the core shares with another bus master, you must manage cacheflush/invalidation explicitly.

2) If you use caching, youneed to ensure no two-packet buffers ever share the same cache line.Bugs that are difficult to diagnose may result.

3) Be careful what youcache. Temporal locality refers to the amount of time between accessesto the data. If you access a piece of data once or access itinfrequently, it has low temporal locality.

If you mark this kind of data for caching, the cache replacementalgorithm can cause the eviction of performance sensitive data by thislower priority data.

5) The processor implementsa round-robin line replacement algorithm.

Conclusion
The techniques discussed here are suggested solutions to the problemsproposed, and as such, they are provided for informational purposesonly. Neither the authors nor Intel Corporation can guarantee thatthese proposed solutions will be applicable to your application or thatthey will resolve the problems in all instances.

The performance tests and ratings mentioned here were measured usingspecific computer systems and/or components, and the results reflectthe approximate performance of Intel products as measured by thosetests. Any difference in system hardware or software design orconfiguration can affect actual performance. Buyers should consultother sources of information to evaluate the performance of systems orcomponents they are considering purchasing.

Next in Part 2:  Patterns foruse in networking applications

This articlewas excerpted from Designing Embedded Networking Applications, by Peter Barry andGerard Hartnett and published by Intel Press. Copyright © 2005Intel Corporation. All rights reserved.

Peter Barry and Gerard Hartnett are senior Intel engineers. Both have beenleaders in the design of Intel network processor platforms and areregularly sought out for their expert guidance on network components.

References:
(Alexander 1979) Alexander, Christopher 1979. The TimelessWay of Building. OxfordUniversity Press
(Gamma et al. 1995) Gamma, Erich, Richard Helm, Ralph Johnson and John Vlissides. 1995.DesignPatterns: Elements of Reusable Object-Oriented Software. AddisonWesley
(Ganssle 1999) GanssleJack. 1999. TheArt of Designing Embedded Systems. Newnes-Elsever
(McConnell 1993) McConnell. 1993. Code Complete: APractical Handbook of SoftwareConstruction. Microsoft Press

Reader Response


Check out chapters on SW Performance Patterns and Anti-patterns – goto http://www.spe-ed.com/ and book by Dr. Connie U. Smithand Dr. Lloyd G. Williams have book titled Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software , published by Addision-Wesley.

-Craig Fullerton
Chandler, AZ


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.