Tricks and techniques for performance tuning your embedded system using patterns: Part 3

In addition to the performance optimization patterns for generalpurpose embedded designs (Part 1)and for networking (Part 2),there are some general code tuning guidelines that are applicable tomost processors. In many cases, these optimizations can decrease thereadability, maintainability, or portability of your code. Be sure youare optimizing code that needs optimization (see Premature Code TuningAvoided).

Reordered Struct
Context: Youhave identified a bottleneck segment of code on your application datapath. The code uses a large struct.

Problem: The struct spans a number of cache lines.

Solution: Reorder the fieldsin a struct to group the frequently accessed fields together. If all ofthe accessed fields fit on a cache line, the first access pulls themall into cache, potentially avoiding data-dependency stalls whenaccessing the other fields. Organize all frequently written fields intothe same half-cache line.

Forces: Reordering structs might not be feasible. Some structs might map to apacket definition.

Supersonic ISR
Context: Yourapplication uses multiple interrupt service routines (ISR) to signalthe availability of data on an interface and trigger the processing ofthat data.

Problem: Interrupt service routines can interfere with other ISR and packetprocessing code.

Solution: Keep ISRs short. Design them to be re-entrant.
An ISR should just give a semaphore, set a flag, or en-queue a packet.You should de-queue and process the data outside the ISR. This way, youobviate the need for interrupt locks around data in an ISR.

Interrupt locks in a frequent ISR can have hard-to-measure effectson the overall system. For more detailed interrupt design guidelines,see Doing Hard Time (Douglass 1999).

Forces: Considerthe following things when designing your ISR:
1) Posting to a semaphore orqueue can cause extra context switches, which reduce the overallefficiency of a system.

2) Bugs in ISRs usually havea catastrophic effect. Keeping them short and simple reduces theprobability of bugs.

Assembly-Language-Critical Functions
Context: You have identified a C function that consumes a significant portion ofthe data path.

Problem: The code generated for this function might not be optimal for yourprocessor.

Solution: Re-implementthe critical function directly in assembly language. Use the bestcompiler for the application (see Best Compiler for Application) togenerate initial assembly code, then hand-optimize it.

Forces: Keepthe following things in mind:
1) Modern compiler technologyis beginning to out-perform the ability of humans to optimize assemblylanguage for sophisticated processors.

2) Assembly language ismore difficult to read and maintain.

3) Assembly language is moredifficult to port to other processors.

Inline Functions
Context: Youhave identified a small C function that is called frequently on thedata path.

Problem: The overhead associated with the entry and exit to the function canbecome significant in a small function, frequently called on by theapplication data path.

Solution: Declarethe function inline. This way, the function gets inserted directly intothe code of the calling function.

Forces: Be aware of the following:
1) Inline functions canincrease the code size of your application and add stress to theinstruction cache.

2) Some debuggers havedifficulty showing the thread of execution for inline functions.

3) A function call itselfcan limit the compiler's ability to optimize register usage in thecalling function.

4) A function call cancause a data dependency stall when a register waiting for an SDRAMaccess is still in flight.

Cache-Optimizing Loop
Context: You have identified a critical loop that is a significant part of thedata-path performance.

Problem: Thestructure of the loop or the data on which it operates could be”thrashing” the data cache.

Solution: You can consider a number of loop/data optimizations:
1) Array merging – the loopuses two or more arrays. Merges them into a single array of a struct.
2) Induction variableinterchange
3) Loop fusion

For more details, see the Intel IXP400 software programmer's guide(Intel 2004c).

Forces: Loop optimizations can make the code harder to read, understand, andmaintain.

Minimizing Local Variables
Context: You have identified a function that needs optimization. Itcontains a large number of local variables.

Problem: A large number of local variables might incur the overhead of storingthem on the stack. The compiler might generate code to set up andrestore the frame pointer.

Solution: Minimize the number of local variables. The compiler may be able tostore all the locals and parameters in registers.

Forces: Removing local variables can decrease the readability of code orrequire extra calculations during the execution of the function.

Explicit Registers
Context: Youhave identified a function that needs optimization. A local variable ora piece of data is frequently used in the function.

Problem: Sometimes the compiler does not identify a register optimization.

Solution: It is worth trying explicit register hints to local variables that arefrequently used in a function.

It can also be useful to copy a frequently used part of a packetthat is also used frequently in a data path algorithm, into a localvariable declared register. An optimization of this kind made aperformance improvement of approximately 20 percent in one customerapplication.

Alternatively you could add a local variable or register toexplicitly “cache” a frequently used global variable. Some compilers donot work on global variables in local variables or registers.

If you know the global is not modified by an interrupt handler andthe global is modified a number of times in the same function, copy itto a register local variable, make updates to the local and then writethe new value back out to the global before exiting the function. Thistechnique is especially useful when updating packet statistics in aloop handling multiple packets.

Forces: Theregister keyword is only a hint to the compiler.

Optimized Hardware Register Use
Context: The data path code does multiple reads or writes to one or morehardware registers.

Problem: Multipleread-operation-writes on hardware registers can cause the processor tostall.

Solution: First, break up read-operation-write statements to hide some of thelatencies when dealing with hardware registers. For example:

Second, search the data path code for multiple writes to the samehardware register. Combine all the separate writes into a single writeto the actual register.

For example, some applications disable hardware interrupts usingmultiple set/resets of bits in the interrupt enable register. In onesuch application when we manually combined these write instructions,performance improved by approximately 4 percent.

Forces: Manuallyseparated read-operation-write code expands code slightly. It can alsoadd local variables and could trigger the creation of a frame pointer.

Avoiding the OS Packet-Buffer Pool
Context: The application uses a system packet buffer pool.

Problem: Memory allocation or calls to packet buffer pool libraries can beprocessor intensive. In some operating systems, these functions lockinterrupts and use local semaphores to protect simultaneous access toshared-heaps.

Pay special attention to packet-buffer management at design time.Are buffers being allocated on the application data path? Where in thedesign are packet buffers replenished into the Intel IXP400 softwareaccess layer?

Solution: Avoidallocating or interacting with the RTOS packet buffer pool on the datapath. Pre-allocate packet buffers outside the data path and store themin lightweight software pools/queues.

Stacks or arrays are typically faster than linked lists for packetbuffer pool collections, as they require less memory accesses to addand remove buffers. Stacks also improve data cache utilization.

Forces: OS packet-buffer pools implement buffer collections. Writing anotherlight collection duplicates functionality.

C-Language Optimizations
Context: Youhave identified a function or segment of code that is consuming asignificant portion of the Intel XScale core's clock cycles on the datapath. You might have identified this code using profiling tools (seeProfiling Tools) or a performance measurement.

Problem: A function or segment of C-code needs optimization.

Solution: Youcan try a number of C-language level optimizations:
1) Pass large functionparameters by reference, never by value. Values take time to copy anduse registers.

2) Avoid array indexing.Use pointers.

3) Minimize loops bycollecting multiple operations into a single loop body.

4) Avoid long if-then-elsechains. Use a switch statement or a state machine.

5) Use int (naturalword-size of the processor) to store flags rather than char or short.

6) Use unsigned variants ofvariables and parameters where possible. Doing so might allow somecompilers to make optimizations.

7) Avoid floating pointcalculations on the data path.

8) Use decrementing loopvariables, for example, for (i=10;i–;) {do something } or even better do { something } while (i–) . Lookat the code generated by your compiler in this case. The Intel XScalecore has the ability to modify the processor condition codes using addsand subs rather than add and sub saving a cmp instruction in loops. Formore details, see the Intel IXP400 software programmer's guide (Intel2004c).

9) Placethe most frequently true statement first in if-else statements.

10) Place frequent caselabels first.

11) Write small functionsbut balance size with complexity and performance. The compiler likes toreuse registers as much as possible and cannot do it in complex nestedcode. However, some compilers automatically use a number of registerson every function call. Extra function call entries and returns cancost a large number of cycles in tight loops.

12) Use the function returnas opposed to an output parameter to return a value to a callingfunction. Return values on many processors are stored in a register byconvention.

13) For critical loops useDuff's device, a devious, generic technique for unrolling loops. Seequestion 20.35 in the comp.lang.C FAQ(Summit 1995). For other similartips, see Chapter 29 in Code Complete (McConnell 1993).

Forces: Good compilers makea number of these optimizations.

Conclusion
The techniques discussed here are suggested solutions to the problemsproposed, and as such, they are provided for informational purposesonly. Neither the authors nor Intel Corporation can guarantee thatthese proposed solutions will be applicable to your application or thatthey will resolve the problems in all instances.

The performance tests and ratings mentionedhere were measured using specific computer systems and/or components,and the results reflect the approximate performance of Intel productsas measured by those tests. Any difference in system hardware orsoftware design or configuration can affect actual performance. Buyersshould consult other sources of information to evaluate the performanceof systems or components they are considering purchasing.

To read Part 1 go to Areview of general patterns.
To read Part 2, go to Patternsfor use in networking applications.

Thisarticle was excerpted from Designing Embedded Networking Applications, by Peter Barry andGerard Hartnett and published by Intel Press. Copyright © 2005Intel Corporation. All rights reserved.

Peter Barry and Gerard Hartnett are senior Intel engineers. Both have beenleaders in the design of Intel network processor platforms and areregularly sought out for their expert guidance on network components.

References:
(Alexander 1979) Alexander, Christopher 1979. The TimelessWay of Building. Oxford University Press
(Gamma et al. 1995) Gamma, Erich, Richard Helm, Ralph Johnson and John Vlissides. 1995. DesignPatterns: Elements of Reusable Object-Oriented Software. AddisonWesley
(Ganssle 1999) GanssleJack. 1999. TheArt of Designing Embedded Systems. Newnes-Elsever
(McConnell 1993) McConnell. 1993. Code Complete: APractical Handbook of Software Construction. Microsoft Press.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.