Overcoming Latency in PCIe Systems - Embedded.com

Overcoming Latency in PCIe Systems

Overcoming PCI Express (PCIe) latency isn't simply a matter of choosingthe lowest-latency components from among those suitable for anembedded-system design, but it's a good place to start.

It's also a matter of architecting operations to reduce or eliminatethe sensitivity of system performance to latency. It's impossible tomask all the latency, so the less there is to begin with, the better.

Defining Latency
Latency is the delay between starting and completing an action. For aswitch, it's the time between the start-of-packet (SoP) symbol on aninput pin and the SoP symbol on an output pin for the same packetforwarded through the switch.

From an endpoint's perspective, the latency includes the packettransmission time, since it can't use the data until it has seen the cyclic redundancy check (CRC) atthe end and checked for errors. At the highest level, the overall tasklatency, which may include multiple switch latencies, is what reallymatters. At issue is whether resources are idled during the waitingperiod implied by a task or transfer latency, and whether the waitingtime prevents a deadline from being met.

A PCIe switch's latency can be decomposed into the time required toreceive the header, a pipeline delay and a queuing delay. The pipelinedelay is the length of time for a packet to traverse an otherwise emptyswitch and is solely a function of the switch's design.

The queuing delay depends to a large extent upon the traffic patternbut can also be dependent on flow control credits, as well as theswitch's arbitration and scheduling policies. Deficiencies in aswitch's implementation or architecture most often show up when dealingwith flows consisting primarily of short packets.

Therefore, a switch's performance should be evaluated with shortpackets, long packets, and, of course, with a packet mix and flowpattern representative of the application.

Latency and Throughput
Engineers often try to extract full-wire-speed performance from thesystem interconnect but that can be a mistake if low latency is alsorequired. The closer the egress link of a switch port is to beingsaturated, the deeper the queue in the buffers behind it.

With large buffers, it's seldom necessary to throttle back an input,so maximum throughput is obtained. The price is the latency of thequeues that develop. With traffic patterns that lend themselves tomathematical analysis, (e.g. uniform distribution and Poisson arrivaltimes), the average queue depth can be estimated from queuing theory.

Without getting into the mathematical details, Figure 1 belowprovides a rough guide as to the queue depth that will develop behind aswitch egress port, based on the number of ports feeding an output andthe degree of utilization of its output link. For simplicity,equal-sized packets were assumed. Keep in mind that on the order offive percent of the link is consumed in DLL overhead.

Figure1: Average Queue Depth Behind PCIe Switch Egress Port

Link Width and Latency
There is another, less-esoteric relationship between bandwidth andlatency. A switch cannot forward a packet until it has seen enough ofthe packet's header to determine its egress port. The wider the inputlink, the less time required to see the complete header.

On an x16 link, the entire header may be visible in a single clock,depending upon how the SoP symbol aligns on the link. On a x8 link, ittakes as few as two clock cycles to see the entire header. Each halvingof the link width doubles this component of the switch latency.

The situation is more complicated when the egress link is wider thanthe ingress link. A switch can't initiate cut-through on a packet untilit has received enough of it so that the faster egress link won't rundry of packet to send before the rest of the packet comes in.

Roughly speaking, if the egress is twice the width of the ingress,then half the packet must be received before forwarding starts.Ironically, using an egress link that is wider than the ingress linkwill increase the latency measured to the SoP symbol, but decrease itwhen measuring to the end-of-packet (EoP) symbol. An endpoint can'tmake use of the packet until it checks the CRC at the end of thepacket.

Thus, using wider links can have three beneficial effects ” reducingthe cut-through latency, the queuing delay, and the packet transmissiontime. As can be seen in Figure 1, doubling the output link width canshrink the queue depth from near-maximum to nearly empty.

Latency Sensitivity of Reads
A read is generally considered to be a blocking operation in that oncea read request is initiated no additional instructions in its thread ofprocessing can be undertaken until it is completed. Simple applicationshave the following work flow:

1. Make a read request
2. Wait for data
3. Process the data
4. Loop back to 1

In this simple example, the latency of the read directly affects thethroughput. If the read latency is much smaller than the processingtime, then latency isn't a problem. When it's not small, users look forways to mask the latency by doing useful work during it.

A multithreaded processorcould switch threads, for example, doing some other work during thelatency. Optimizing compilers issue the read early to minimize thewait.

Bus interface units often have an ability to issue multiple readrequests before being forced to wait for a completion. If, for example,N outstanding read requests are supported, and the completion to thefirst read request arrives before the Nth read request is sent, thenlatency is said to have been masked and full throughput can be achievedafter that initial waiting period.

In practice, devices have varying degrees of ability to mask latencyso in a system, such as a PC or server where there is no control as towhat is plugged into an open slot, latency is always an issue.

Bridging Legacy PCI Devices to PCIe
When bridging PCI to PCIe, the bridge must make a guess as to how muchdata the device will consume on a read. If the bridge guesses wrong,performance suffers.

An advanced bridge will use the version of the PCI read command as ahint. In response to a simple MemRd, it will fetch only a single buswidth of data. In response to a RdLin command, it will typicallyprefetch a cache line of data.

Use of the RdMult command on PCI should result in the prefetch ofmultiple cache lines. After prefetching the data, the bridge shouldretain it in a cache after an initial disconnection by the PCI devicein case the device returns for more data.

When the PCIe-to-PCI bridge's prefetch policy isn't adequate, it canhelp to insert a PCI-to-PCI bridge in the path to the device. Thebridge can be configured, for example, to translate a MemRd or RdLincommand into a Read Multiple command, and to keep the data longer inits internal prefetch cache. For both PCI-to-PCI and PCIe-to-PCIbridges, it's necessary to do device-specific configuration to enableadvanced prefetch features.

DMA I/O and Read Latency
The DMA I/O subsystem at the heart of PCs and servers is inherentlylatency-sensitive. I/O is accomplished using a DMA controller in eachI/O device to move data between it and main memory located next to theCPU.

The DMA controller follows a chainof descriptors located in memory. Each descriptor describes a unit ofwork assigned to the DMAC, requiring the DMAC to move a block of datafrom the device to memory or from memory to the device.

The DMAC reads a descriptor, then assigns a DMA engine to do thedata movement dictated by the descriptor. While the data is beingmoved, it reads the next descriptor. If the DMA engine completes itsassignment before the next descriptor read completes, it is forced toidle for lack of work.

Typically, a workload for tasks such as networking consists of a mixof short and long data blocks (packets) to be moved to and from memory.When the data block is relatively long, latency is masked. For shortblocks, such as those used for Ethernet control packets, descriptorread latency can lead to a loss of throughput.

To avoid this, a sophisticated DMAC may read several descriptorsahead and maintain a cache of prefetched descriptors. However, there isalways a limit to the size of the cache and to the number ofdescriptors available to be prefetched.

A particular device may be capable of masking descriptor readlatency when directly attached to a Northbridge(NB) but its throughput may suffer when it is connected tothe NB through a switch. System designers are best advised to use thelowest-latency switches available to maximize the performance of theirI/O subsystems.

Accelerators and Switch Latency
An increasingly common usage model is the attachment of multipleaccelerators to a processor complex to increase performance for certainapplications. Examples are the use of graphics processors for floatingpoint acceleration.

In the accelerator model, the host processor offloads a computationto the accelerators, then waits for the result. It may or may not haveuseful work to perform while waiting. Only if the waiting time is lessthe time required to complete the operation without an accelerator isthere a gain in throughput.

The amount of time the host waits is the sum of:

1. Synchronization time atstart of computation
2. Time for accelerator to readdata from memory
3. Time for accelerator toproduce the result
4. Time for accelerator towrite the result back to memory
5. Synchronization time at endof computation

Each of these operations, except for the computation time itself(#3) includes the interconnect latency. All the usual games of maskinglatency with concurrency apply.

Nevertheless, when you consider that typical accelerators operate inthe GHz range while interconnect latency is generally greater than 150nanoseconds, you can see it is necessary to offload a relatively longcomputation in order to gain throughput by offloading work to theaccelerator.

Every step decrease in switch or interconnect latency widens therange of problems to which accelerators may be profitably applied.

Given enough time and resources, engineers can usually figure out howto mask any fixed amount of latency. Often this effort consumes most oftheir development time and contributes significantly to the end cost oftheir product. No more dramatic example of this exists than the diearea consumed by cache, cache controllers, and support for multiplethreads on high-end microprocessor chips.

Efforts to mask latency achieve varying degrees of success. Systeminterconnects have varying degrees of latency. In practice, we see somedevices showing latency sensitivity in some slots of some systems.These disturbing observations lead to additional effort to root causeand find ways to reduce the latency sensitivity, thus extending thetime to market.

The availability of low-latencyswitches makes the job of everyone producing a PCIe-basedinfrastructure easier. Industry-leading switches drop latency toas low as 110ns, or 87 percent lower than competing devices on themarket. Low latency switches such as these should be the first choiceof system engineers interested in producing high-performance systems.

Jack Regula is Chief TechnologyOfficer at PLX Technology.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.