Tricks and techniques for performance tuning your embedded system using patterns: Part 3

Peter Barry and Gerard Hartnett, Intel Corp.

June 23, 2007

Peter Barry and Gerard Hartnett, Intel Corp.June 23, 2007

In addition to the performance optimization patterns for general purpose embedded designs (Part 1) and for networking (Part 2), there are some general code tuning guidelines that are applicable to most processors. In many cases, these optimizations can decrease the readability, maintainability, or portability of your code. Be sure you are optimizing code that needs optimization (see Premature Code Tuning Avoided).

Reordered Struct
Context: You have identified a bottleneck segment of code on your application data path. The code uses a large struct.

Problem: The struct spans a number of cache lines.

Solution: Reorder the fields in a struct to group the frequently accessed fields together. If all of the accessed fields fit on a cache line, the first access pulls them all into cache, potentially avoiding data-dependency stalls when accessing the other fields. Organize all frequently written fields into the same half-cache line.

Forces: Reordering structs might not be feasible. Some structs might map to a packet definition.

Supersonic ISR
Context: Your application uses multiple interrupt service routines (ISR) to signal the availability of data on an interface and trigger the processing of that data.

Problem: Interrupt service routines can interfere with other ISR and packet processing code.

Solution: Keep ISRs short. Design them to be re-entrant.
An ISR should just give a semaphore, set a flag, or en-queue a packet. You should de-queue and process the data outside the ISR. This way, you obviate the need for interrupt locks around data in an ISR.

Interrupt locks in a frequent ISR can have hard-to-measure effects on the overall system. For more detailed interrupt design guidelines, see Doing Hard Time (Douglass 1999).

Forces: Consider the following things when designing your ISR:
1) Posting to a semaphore or queue can cause extra context switches, which reduce the overall efficiency of a system.

2) Bugs in ISRs usually have a catastrophic effect. Keeping them short and simple reduces the probability of bugs.

Assembly-Language-Critical Functions
Context: You have identified a C function that consumes a significant portion of the data path.

Problem: The code generated for this function might not be optimal for your processor.

Solution: Re-implement the critical function directly in assembly language. Use the best compiler for the application (see Best Compiler for Application) to generate initial assembly code, then hand-optimize it.

Forces: Keep the following things in mind:
1) Modern compiler technology is beginning to out-perform the ability of humans to optimize assembly language for sophisticated processors.

2) Assembly language is more difficult to read and maintain.

3) Assembly language is more difficult to port to other processors.

Inline Functions
Context: You have identified a small C function that is called frequently on the data path.

Problem: The overhead associated with the entry and exit to the function can become significant in a small function, frequently called on by the application data path.

Solution: Declare the function inline. This way, the function gets inserted directly into the code of the calling function.

Forces: Be aware of the following:
1) Inline functions can increase the code size of your application and add stress to the instruction cache.

2) Some debuggers have difficulty showing the thread of execution for inline functions.

3) A function call itself can limit the compiler's ability to optimize register usage in the calling function.

4) A function call can cause a data dependency stall when a register waiting for an SDRAM access is still in flight.

Cache-Optimizing Loop
Context: You have identified a critical loop that is a significant part of the data-path performance.

Problem: The structure of the loop or the data on which it operates could be "thrashing" the data cache.

Solution: You can consider a number of loop/data optimizations:
1) Array merging - the loop uses two or more arrays. Merges them into a single array of a struct.
2) Induction variable interchange
3) Loop fusion

For more details, see the Intel IXP400 software programmer's guide (Intel 2004c).

Forces: Loop optimizations can make the code harder to read, understand, and maintain.

Minimizing Local Variables
Context: You have identified a function that needs optimization. It contains a large number of local variables.

Problem: A large number of local variables might incur the overhead of storing them on the stack. The compiler might generate code to set up and restore the frame pointer.

Solution: Minimize the number of local variables. The compiler may be able to store all the locals and parameters in registers.

Forces: Removing local variables can decrease the readability of code or require extra calculations during the execution of the function.

Explicit Registers
Context: You have identified a function that needs optimization. A local variable or a piece of data is frequently used in the function.

Problem: Sometimes the compiler does not identify a register optimization.

Solution: It is worth trying explicit register hints to local variables that are frequently used in a function.

It can also be useful to copy a frequently used part of a packet that is also used frequently in a data path algorithm, into a local variable declared register. An optimization of this kind made a performance improvement of approximately 20 percent in one customer application.

Alternatively you could add a local variable or register to explicitly "cache" a frequently used global variable. Some compilers do not work on global variables in local variables or registers.

If you know the global is not modified by an interrupt handler and the global is modified a number of times in the same function, copy it to a register local variable, make updates to the local and then write the new value back out to the global before exiting the function. This technique is especially useful when updating packet statistics in a loop handling multiple packets.

Forces: The register keyword is only a hint to the compiler.

Optimized Hardware Register Use
Context: The data path code does multiple reads or writes to one or more hardware registers.

Problem: Multiple read-operation-writes on hardware registers can cause the processor to stall.

Solution: First, break up read-operation-write statements to hide some of the latencies when dealing with hardware registers. For example:

Second, search the data path code for multiple writes to the same hardware register. Combine all the separate writes into a single write to the actual register.

For example, some applications disable hardware interrupts using multiple set/resets of bits in the interrupt enable register. In one such application when we manually combined these write instructions, performance improved by approximately 4 percent.

Forces: Manually separated read-operation-write code expands code slightly. It can also add local variables and could trigger the creation of a frame pointer.

Avoiding the OS Packet-Buffer Pool
Context: The application uses a system packet buffer pool.

Problem: Memory allocation or calls to packet buffer pool libraries can be processor intensive. In some operating systems, these functions lock interrupts and use local semaphores to protect simultaneous access to shared-heaps.

Pay special attention to packet-buffer management at design time. Are buffers being allocated on the application data path? Where in the design are packet buffers replenished into the Intel IXP400 software access layer?

Solution: Avoid allocating or interacting with the RTOS packet buffer pool on the data path. Pre-allocate packet buffers outside the data path and store them in lightweight software pools/queues.

Stacks or arrays are typically faster than linked lists for packet buffer pool collections, as they require less memory accesses to add and remove buffers. Stacks also improve data cache utilization.

Forces: OS packet-buffer pools implement buffer collections. Writing another light collection duplicates functionality.

C-Language Optimizations
Context: You have identified a function or segment of code that is consuming a significant portion of the Intel XScale core's clock cycles on the data path. You might have identified this code using profiling tools (see Profiling Tools) or a performance measurement.

Problem: A function or segment of C-code needs optimization.

Solution: You can try a number of C-language level optimizations:
1) Pass large function parameters by reference, never by value. Values take time to copy and use registers.

2) Avoid array indexing. Use pointers.

3) Minimize loops by collecting multiple operations into a single loop body.

4) Avoid long if-then-else chains. Use a switch statement or a state machine.

5) Use int (natural word-size of the processor) to store flags rather than char or short.

6) Use unsigned variants of variables and parameters where possible. Doing so might allow some compilers to make optimizations.

7) Avoid floating point calculations on the data path.

8) Use decrementing loop variables, for example, for (i=10; i--;) {do something} or even better do { something } while (i--). Look at the code generated by your compiler in this case. The Intel XScale core has the ability to modify the processor condition codes using adds and subs rather than add and sub saving a cmp instruction in loops. For more details, see the Intel IXP400 software programmer's guide (Intel 2004c).

9) Place the most frequently true statement first in if-else statements.

10) Place frequent case labels first.

11) Write small functions but balance size with complexity and performance. The compiler likes to reuse registers as much as possible and cannot do it in complex nested code. However, some compilers automatically use a number of registers on every function call. Extra function call entries and returns can cost a large number of cycles in tight loops.

12) Use the function return as opposed to an output parameter to return a value to a calling function. Return values on many processors are stored in a register by convention.

13) For critical loops use Duff's device, a devious, generic technique for unrolling loops. See question 20.35 in the comp.lang.C FAQ (Summit 1995). For other similar tips, see Chapter 29 in Code Complete (McConnell 1993).

Forces: Good compilers make a number of these optimizations.

The techniques discussed here are suggested solutions to the problems proposed, and as such, they are provided for informational purposes only. Neither the authors nor Intel Corporation can guarantee that these proposed solutions will be applicable to your application or that they will resolve the problems in all instances.

The performance tests and ratings mentioned here were measured using specific computer systems and/or components, and the results reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration can affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing.

To read Part 1 go to A review of general patterns.
To read Part 2, go to Patterns for use in networking applications.

This article was excerpted from Designing Embedded Networking Applications, by Peter Barry and Gerard Hartnett and published by Intel Press. Copyright © 2005 Intel Corporation. All rights reserved.

Peter Barry and Gerard Hartnett are senior Intel engineers. Both have been leaders in the design of Intel network processor platforms and are regularly sought out for their expert guidance on network components.

(Alexander 1979) Alexander, Christopher 1979. The Timeless Way of Building. Oxford University Press
(Gamma et al. 1995) Gamma, Erich, Richard Helm, Ralph Johnson and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Addison Wesley
(Ganssle 1999) Ganssle Jack. 1999. The Art of Designing Embedded Systems. Newnes-Elsever
(McConnell 1993) McConnell. 1993. Code Complete: A Practical Handbook of Software Construction. Microsoft Press.

Loading comments...