Tricks and techniques for performance tuning your embedded system using patterns: Part 2

In addition to the performance tuning patterns in Part 1 that can be employed inany embedded systems design, there are a number of patterns that can beapplied to networking performance in general. These techniques are nottypically specific to the IXP42X product line.

Bottleneck Hunting
Context: You have a running functional system. You have a performancerequirement (see Defined Performance Requirement). A customer ismeasuring performance lower than that requirement.

Problem: You can have a number of performance bottlenecks in the designed systembut unless you identify the current limiting factor, you might optimizethe wrong thing. One component of the system might be limiting the flowof network packets to the rest of the system.

Solution: Performanceimprovement really starts with bottleneck hunting. It is only when youfind the performance-limiting bottleneck, that you can work onoptimizations to remove the bottleneck. A system typically has a numberof bottlenecks. You first need to identify the current limitingbottleneck then remove it. You then need to iterate through theremaining bottlenecks until the system meets its performancerequirements.

First, determine if your application is CPU or I/O bound. In aCPU-bound system, the limiting factor or bottleneck is the amount ofcycles needed to execute some algorithm or part of the data path. In anI/O bound system, the bottleneck is external to the processor. Theprocessor has enough CPU cycles to handle the traffic, but the trafficflow is not enough to make full use of the available processor cycles.

To determine if the system is CPU or I/O bound, try running theprocessor at a number of different clock speeds. If you see asignificant change in performance, your system is probably CPU bound.

Next, look at the software components of the data path; these mightinclude:

1) Low-level device driversspecific to a piece of hardware. These device drivers could conform toan OS-specific interface.

2) network interface servicemechanism running on the Intel XScale core. This mechanism might be anumber of ISRs or a global polling loop.

3) Adapter components orglue code that adapts the hardware-specific drivers or the underlying IntelIXP400 software APIs to an RTOS or network stack.

4) Encapsulation layers ofthe networking stack

5) Theswitching/bridging/routing engine of the networking stack or the RTOS

6) Intel IXP400 softwareaccess APIs. These functions provide an abstraction of the underlyingmicrocode and silicon.

7) IXP42X NPE microcode

If some algorithm in a low-level device driver is limiting the flowof data into the system, you might waste your time if you starttweaking compiler flags or optimize the routing algorithm.

It is best to look at the new or unique components to a particularsystem first. Typically, these are the low-level device drivers or theadapter components unique to this system. Other projects have alreadyused the routing-algorithm, the Intel IXP400 software, and the NPEmicrocode.

Concentrate on the unique components first, especially if thesecomponents are on the edge of the system. In one wireless applicationwe discovered the wireless device driver was a bottleneck that limitedthe flow of data into the system.

Many components of a data path can contain packet buffers. Packetcounters inserted in the code can help you identify queue overflows orunderflows. Typically, the code that consumes the packet buffer is thebottleneck.

This process is typically iterative. When you fix the currentbottleneck, you then need to loop back and identify the next one.

Forces: Be aware of the following:
1) Most systems have multiple bottlenecks.
2) Early bottleneck hunting—before you have a complete runningsystem—increases the risk of misidentified bottlenecks and wastedtuning effort.

Evaluating Traffic Generator andProtocols
Context: You are using a network-traffic generator and protocols to measure theperformance of a system.

Problem: The performance test or protocol overheads can limit the measuredperformance of your application.

Solution: Identifyingthe first bottleneck is a challenge. First, you need to eliminate yourtraffic generators and protocols as bottlenecks and analyze theinvariants. Typical components in a complete test system might include:

1) Traffic sources, sinks,and measurement equipment.
2) The device undertest (DUT) for which you are tuning the performance.
3) Physical connections andprotocols between traffic sources and the DUT.

Your test environment might use a number of different types oftraffic sources, sinks, and measurement equipment. You need to firstmake sure they are not the bottleneck in your system.

Equipment, like Smartbits and Adtech testers, isnot typically a bottleneck. However, using a PC with FTP software to measureperformance can be a bottleneck. You need to test the PC and FTP software without the DUT to makesure your traffic sources can reach the performance you require.

Running this test can also flush out bottlenecks in the physicalmedia or protocols you are using. In addition, you need to make surethe overhead inherent in the protocols you are using makes theperformance you require feasible. For example:

1) You cannot expect 100megabits per second over Ethernet with 64-byte packets due tointer-frame gap and frame preamble. You can expect to get at most 76megabits per second.

2) You cannot expect to get8 megabits per second over an ADSL link; you can expect to getat most 5.5 megabits per second.

3) You cannot expect to get100 megabits per second on FTP running over Ethernet. You must take IPprotocol overhead and TCP acknowledgements into account.

4) You cannot expect 52megabits per second on 802'.11a/g networks due to CTS/RTS overhead andprotocol overhead.

Characteristics of the particular protocol or application could alsobe causing the bottleneck. For example, if the FTP performance is muchlower (by a factor of 2 ) thanthe large-packet performance with a traffic generator (Smartbits), the problem could bethat the TCPacknowledgement packets are getting dropped. This problem can sometimesbe a buffer management issue.

FTP performance can also be significantly affected by the TCP windowsizes on the FTP client and server machines.

Forces: Test equipment typically outperforms the DUT.

Environmental Factors
Context: Youare finding it difficult to identify the bottleneck.
Problem: Environmentalfactors can cause a difficult to diagnose bottleneck.

Solution: Check the environmental factors.
When testing a wireless application you might encounter radiointerference in the test environment. In this case, you can use aFaraday cage to radio-isolate your test equipment and DUT from theenvironment. Antenna configuration is also important. The antennasshould not be too close (<1 meter). They should be erect, not lyingdown. You also need to make sure you shield the DUT to protect it fromantenna interference.

Check shared resources. Isyour test equipment or DUT sharing a resource, such as a networksegment, with other equipment? Is that other equipment making enoughuse of the shared resource to affect your DUT performance?

Check all connectors and cables. If you are confident you are making improvements but the measurementsare not giving the improvement you expect, try changing all the cablesconnecting your DUT to the test equipment. As a last resort, try areplacement DUT. We have seen a number of cases where a device on theDUT had degraded enough to affect performance.

Polled Packet Processor
Context: Youare designing the fundamental mechanism that drives the servicing ofnetwork interfaces.

Problem: Some fundamental mechanisms can expose you to more overhead and wastedCPU cycles. These wasted cycles can come from interruptpreamble/dispatch and context switches.

Solution: Youcan categorize most applications as interrupt or polling driven or acombination of both.

When traffic overloads a system, it runs optimally if it is runningin a tight loop, polling interfaces for which it knows there is trafficqueued. If the application driver is interrupt-based, look to see howmany packets you handle per interrupt. To get better packet processingperformance, handle more packets per interrupt by possibly using apolling approach in the interrupt handler.

Some systems put the packet on a queue from the interrupt handlerand then do the packet processing in another thread. In this kind of asystem, you need to understand how many packets the system handles percontext switch.

To improve performance increase the number of packets handled percontext switch. Other systems can drive packet processing, triggeredfrom a timer interrupt.

In this case, you need to make sure the timer frequency and numberof packets handled per interrupt is not limiting the networkingperformance of your system. In addition, this system is not optimallyefficient when the system is in overload. Systems based on Linux areusually interrupt-based.

Forces: Be sure to consider the following:
1) Reducing wasted CPU cyclescan complicate the overall architecture or design of an application.
2) Some IP stacks or operatingsystems can restrict the options in how you design these fundamentalmechanisms.
3) You might need to throttlethe amount of CPU given to packet processing to allow other processingto happen even when the system is in overload.
4) Applying these techniquesmight increase the latency in handling some packets.

Edge Packet Throttle
Context: The bottleneck of your system is now the IP forwarding or transportparts of the IP stack.

Problem: Youmight be wasting CPU cycles processing packets to later drop them whena queue fills later in the data path.

Solution: When a system goes into overload, it is better to leave the frames backup in the RX queue and let the edges of your system, the NPE and PHYdevices, throttle reception. You can avoid wasting core cycles bychecking a bottleneck indicator, such as queue full, early in the datapath code.

For example, on VxWorks, you can make the main packet-processingtask (netTask) the highest priority task. This technique is one easyway to implement a “self-throttling” system. Alternatively, you couldmake the buffer replenish code a low-priority task, which would ensurereceive buffers are only supplied when you have available CPU.

Forces: Beaware of the following:
1) Checking a bottleneck indicator might weaken the encapsulation of aninternal detail of the IP stack.
2) Implementing an early check wastes some CPU cycles when the systemis in overload.

Detecting Resource Collisions
Context: You make a change and performance drops unexpectedly.

Problem: A resource collision effect could be causing a pronounced performancebottleneck. Examples of such effects we have seen are:

1) TX traffic is beinggenerated from RX traffic; Ethernet is running in half-duplex mode. Thetime it takes to generate the TX frame from an RX frame corresponds tothe inter-frame gap. When the TX frame is sent, it collides with thenext RX frame.

2) The Ethernet interfaceis running full duplex, but traffic is being generated in a loop andthe frame transmissions occur at times the MAC is busy receivingframes.

Solution: These kinds of bottlenecks are difficult to find and can only bechecked by looking at driver counters and the underlying PHY devices.Error counters on test equipment can also help.

Forces: Counters might not be available or easily accessible.

Conclusion
The techniques discussed here are suggested solutions to the problemsproposed, and as such, they are provided for informational purposesonly. Neither the authors nor Intel Corporation can guarantee thatthese proposed solutions will be applicable to your application or thatthey will resolve the problems in all instances.

The performance tests and ratings mentioned here were measuredusing specific computer systems and/or components, and the resultsreflect the approximate performance of Intel products as measured bythose tests. Any difference in system hardware or software design orconfiguration can affect actual performance. Buyers should consultother sources of information to evaluate the performance of systems orcomponents they are considering purchasing.

To read Part 1 go to Areview of general patterns.”

This articlewas excerpted from Designing Embedded Networking Applications, by Peter Barry andGerard Hartnett and published by Intel Press. Copyright © 2005Intel Corporation. All rights reserved.

Peter Barry and Gerard Hartnett are senior Intel engineers. Both have beenleaders in the design of Intel network processor platforms and areregularly sought out for their expert guidance on network components.

References:
(Alexander 1979) Alexander, Christopher 1979. The TimelessWay of Building. Oxford University Press
(Gamma et al. 1995) Gamma, Erich, Richard Helm, Ralph Johnson and John Vlissides. 1995. DesignPatterns: Elements of Reusable Object-Oriented Software. AddisonWesley
(Ganssle 1999) GanssleJack. 1999. TheArt of Designing Embedded Systems. Newnes-Elsever
(McConnell 1993) McConnell. 1993. Code Complete: APractical Handbook of Software Construction. Microsoft Press

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.