Tricks and techniques for performance tuning your embedded system using patterns: Part 2

Peter Barry and Gerard Hartnett, Intel Corp.

June 11, 2007

Peter Barry and Gerard Hartnett, Intel Corp. June 11, 2007

In addition to the performance tuning patterns in Part 1 that can be employed in any embedded systems design, there are a number of patterns that can be applied to networking performance in general. These techniques are not typically specific to the IXP42X product line.

Bottleneck Hunting
Context: You have a running functional system. You have a performance requirement (see Defined Performance Requirement). A customer is measuring performance lower than that requirement.

Problem: You can have a number of performance bottlenecks in the designed system but unless you identify the current limiting factor, you might optimize the wrong thing. One component of the system might be limiting the flow of network packets to the rest of the system.

Solution: Performance improvement really starts with bottleneck hunting. It is only when you find the performance-limiting bottleneck, that you can work on optimizations to remove the bottleneck. A system typically has a number of bottlenecks. You first need to identify the current limiting bottleneck then remove it. You then need to iterate through the remaining bottlenecks until the system meets its performance requirements.

First, determine if your application is CPU or I/O bound. In a CPU-bound system, the limiting factor or bottleneck is the amount of cycles needed to execute some algorithm or part of the data path. In an I/O bound system, the bottleneck is external to the processor. The processor has enough CPU cycles to handle the traffic, but the traffic flow is not enough to make full use of the available processor cycles.

To determine if the system is CPU or I/O bound, try running the processor at a number of different clock speeds. If you see a significant change in performance, your system is probably CPU bound.

Next, look at the software components of the data path; these might include:

1) Low-level device drivers specific to a piece of hardware. These device drivers could conform to an OS-specific interface.

2) network interface service mechanism running on the Intel XScale core. This mechanism might be a number of ISRs or a global polling loop.

3) Adapter components or glue code that adapts the hardware-specific drivers or the underlying Intel IXP400 software APIs to an RTOS or network stack.

4) Encapsulation layers of the networking stack

5) The switching/bridging/routing engine of the networking stack or the RTOS

6) Intel IXP400 software access APIs. These functions provide an abstraction of the underlying microcode and silicon.

7) IXP42X NPE microcode

If some algorithm in a low-level device driver is limiting the flow of data into the system, you might waste your time if you start tweaking compiler flags or optimize the routing algorithm.

It is best to look at the new or unique components to a particular system first. Typically, these are the low-level device drivers or the adapter components unique to this system. Other projects have already used the routing-algorithm, the Intel IXP400 software, and the NPE microcode.

Concentrate on the unique components first, especially if these components are on the edge of the system. In one wireless application we discovered the wireless device driver was a bottleneck that limited the flow of data into the system.

Many components of a data path can contain packet buffers. Packet counters inserted in the code can help you identify queue overflows or underflows. Typically, the code that consumes the packet buffer is the bottleneck.

This process is typically iterative. When you fix the current bottleneck, you then need to loop back and identify the next one.

Forces: Be aware of the following:
1) Most systems have multiple bottlenecks.
2) Early bottleneck hunting—before you have a complete running system—increases the risk of misidentified bottlenecks and wasted tuning effort.

Evaluating Traffic Generator and Protocols
Context: You are using a network-traffic generator and protocols to measure the performance of a system.

Problem: The performance test or protocol overheads can limit the measured performance of your application.

Solution: Identifying the first bottleneck is a challenge. First, you need to eliminate your traffic generators and protocols as bottlenecks and analyze the invariants. Typical components in a complete test system might include:

1) Traffic sources, sinks, and measurement equipment.
2) The device under test (DUT) for which you are tuning the performance.
3) Physical connections and protocols between traffic sources and the DUT.

Your test environment might use a number of different types of traffic sources, sinks, and measurement equipment. You need to first make sure they are not the bottleneck in your system.

Equipment, like Smartbits and Adtech testers, is not typically a bottleneck. However, using a PC with FTP software to measure performance can be a bottleneck. You need to test the PC and FTP software without the DUT to make sure your traffic sources can reach the performance you require.

Running this test can also flush out bottlenecks in the physical media or protocols you are using. In addition, you need to make sure the overhead inherent in the protocols you are using makes the performance you require feasible. For example:

1) You cannot expect 100 megabits per second over Ethernet with 64-byte packets due to inter-frame gap and frame preamble. You can expect to get at most 76 megabits per second.

2) You cannot expect to get 8 megabits per second over an ADSL link; you can expect to get at most 5.5 megabits per second.

3) You cannot expect to get 100 megabits per second on FTP running over Ethernet. You must take IP protocol overhead and TCP acknowledgements into account.

4) You cannot expect 52 megabits per second on 802'.11a/g networks due to CTS/RTS overhead and protocol overhead.

Characteristics of the particular protocol or application could also be causing the bottleneck. For example, if the FTP performance is much lower (by a factor of 2) than the large-packet performance with a traffic generator (Smartbits), the problem could be that the TCP acknowledgement packets are getting dropped. This problem can sometimes be a buffer management issue.

FTP performance can also be significantly affected by the TCP window sizes on the FTP client and server machines.

Forces: Test equipment typically outperforms the DUT.

Environmental Factors
Context: You are finding it difficult to identify the bottleneck.
Problem: Environmental factors can cause a difficult to diagnose bottleneck.

Solution: Check the environmental factors.
When testing a wireless application you might encounter radio interference in the test environment. In this case, you can use a Faraday cage to radio-isolate your test equipment and DUT from the environment. Antenna configuration is also important. The antennas should not be too close (<1 meter). They should be erect, not lying down. You also need to make sure you shield the DUT to protect it from antenna interference.

Check shared resources. Is your test equipment or DUT sharing a resource, such as a network segment, with other equipment? Is that other equipment making enough use of the shared resource to affect your DUT performance?

Check all connectors and cables. If you are confident you are making improvements but the measurements are not giving the improvement you expect, try changing all the cables connecting your DUT to the test equipment. As a last resort, try a replacement DUT. We have seen a number of cases where a device on the DUT had degraded enough to affect performance.

Polled Packet Processor
Context: You are designing the fundamental mechanism that drives the servicing of network interfaces.

Problem: Some fundamental mechanisms can expose you to more overhead and wasted CPU cycles. These wasted cycles can come from interrupt preamble/dispatch and context switches.

Solution: You can categorize most applications as interrupt or polling driven or a combination of both.

When traffic overloads a system, it runs optimally if it is running in a tight loop, polling interfaces for which it knows there is traffic queued. If the application driver is interrupt-based, look to see how many packets you handle per interrupt. To get better packet processing performance, handle more packets per interrupt by possibly using a polling approach in the interrupt handler.

Some systems put the packet on a queue from the interrupt handler and then do the packet processing in another thread. In this kind of a system, you need to understand how many packets the system handles per context switch.

To improve performance increase the number of packets handled per context switch. Other systems can drive packet processing, triggered from a timer interrupt.

In this case, you need to make sure the timer frequency and number of packets handled per interrupt is not limiting the networking performance of your system. In addition, this system is not optimally efficient when the system is in overload. Systems based on Linux are usually interrupt-based.

Forces: Be sure to consider the following:
1) Reducing wasted CPU cycles can complicate the overall architecture or design of an application.
2) Some IP stacks or operating systems can restrict the options in how you design these fundamental mechanisms.
3) You might need to throttle the amount of CPU given to packet processing to allow other processing to happen even when the system is in overload.
4) Applying these techniques might increase the latency in handling some packets.

Edge Packet Throttle
Context: The bottleneck of your system is now the IP forwarding or transport parts of the IP stack.

Problem: You might be wasting CPU cycles processing packets to later drop them when a queue fills later in the data path.

Solution: When a system goes into overload, it is better to leave the frames back up in the RX queue and let the edges of your system, the NPE and PHY devices, throttle reception. You can avoid wasting core cycles by checking a bottleneck indicator, such as queue full, early in the data path code.

For example, on VxWorks, you can make the main packet-processing task (netTask) the highest priority task. This technique is one easy way to implement a "self-throttling" system. Alternatively, you could make the buffer replenish code a low-priority task, which would ensure receive buffers are only supplied when you have available CPU.

Forces: Be aware of the following:
1) Checking a bottleneck indicator might weaken the encapsulation of an internal detail of the IP stack.
2) Implementing an early check wastes some CPU cycles when the system is in overload.

Detecting Resource Collisions
Context: You make a change and performance drops unexpectedly.

Problem: A resource collision effect could be causing a pronounced performance bottleneck. Examples of such effects we have seen are:

1) TX traffic is being generated from RX traffic; Ethernet is running in half-duplex mode. The time it takes to generate the TX frame from an RX frame corresponds to the inter-frame gap. When the TX frame is sent, it collides with the next RX frame.

2) The Ethernet interface is running full duplex, but traffic is being generated in a loop and the frame transmissions occur at times the MAC is busy receiving frames.

Solution: These kinds of bottlenecks are difficult to find and can only be checked by looking at driver counters and the underlying PHY devices. Error counters on test equipment can also help.

Forces: Counters might not be available or easily accessible.

The techniques discussed here are suggested solutions to the problems proposed, and as such, they are provided for informational purposes only. Neither the authors nor Intel Corporation can guarantee that these proposed solutions will be applicable to your application or that they will resolve the problems in all instances.

The performance tests and ratings mentioned here were measured using specific computer systems and/or components, and the results reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration can affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing.

To read Part 1 go to "A review of general patterns."

This article was excerpted from Designing Embedded Networking Applications, by Peter Barry and Gerard Hartnett and published by Intel Press. Copyright © 2005 Intel Corporation. All rights reserved.

Peter Barry and Gerard Hartnett are senior Intel engineers. Both have been leaders in the design of Intel network processor platforms and are regularly sought out for their expert guidance on network components.

(Alexander 1979) Alexander, Christopher 1979. The Timeless Way of Building. Oxford University Press
(Gamma et al. 1995) Gamma, Erich, Richard Helm, Ralph Johnson and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Addison Wesley
(Ganssle 1999) Ganssle Jack. 1999. The Art of Designing Embedded Systems. Newnes-Elsever
(McConnell 1993) McConnell. 1993. Code Complete: A Practical Handbook of Software Construction. Microsoft Press

Loading comments...