One of the most common uses of PCI Express (PCIe) switches is in fan-out applications, which present a significant challenge to designers–reducing congestion that inevitably occurs in high-traffic embedded systems.
In the DMA I/O model prevalent in workstations and servers, DMA controllers in the I/O device endpoints both write blocks of data to host memory and read from that host. The host connection in this application is a point of aggregation that's usually wider than any of the endpoint connections without necessarily being as wide as the sum of the widths of all of them. If it isn't, then congestion and bandwidth sharing are primary concerns.
Even in the ideal case of host bandwidth equaling or exceeding the sum of all the devices' bandwidth, read completions are often delivered to an endpoint faster than it can consume them, leading to congestion and possibly to the “starvation” of other endpoints. The PCIe specification prohibits endpoints from flow-controlling completions. An endpoint is required to reserve buffers in advance for all the read data that it requests so that it can consume the data from the PCIe link at wire speed when it returns. This eliminates queuing in the interconnect when the source and the sink are the same bandwidth. It doesn't begin to address the problem when, as is so often the case, the source (host) has a wider, faster connection than the endpoint.
This problem is compounded by the observed behavior of root complexes (RCs) and the endpoints themselves. RCs typically service read requests in a first-in-first-out order, instead of a round robin fashion among queued requests from different devices, to avoid choking them. Endpoint designs often spring from a PCI legacy, where it was necessary to read ahead very aggressively in order to gain a fair share of bandwidth. The problem is that a device with a narrow PCIe link can request a large block of data from the host, thus blocking all other read requests in a queue at the RC. When served, the resulting completions back up in the switch, blocking other downstream completions until sunk by the endpoint. This throttles the downstream throughput of the RC's link to the often fractionally slower rate of the endpoint link.
To combat this downstream congestion, the user can try to configure the read behavior of the endpoints. Unfortunately, there are no PCIe-architected mechanisms for traffic shaping or rate limiting, and the desired device knobs often don't exist. The designer may be able to reduce the maximum read request size. This may help but it isn't a complete solution.
What does provide such a solution is Read Pacing, a feature built into a new generation of PCIe Gen 2 (5 GT/s) switches. Because the PCIe specification allows posted writes and completions to bypass read requests, Read Pacing enables excess upstream read requests–the requests for data in excess of what's required to mask the roundtrip latency between endpoint and RC–to be delayed in the switch. This both avoids blocking other devices' read requests in the RC and limits the completion queue size in the switch. Additionally, upstream request flow is metered so that completions arrive at the rate at which they may be sunk. A small completion queue is allowed to develop but never so much as to block completions to other devices.
Jack Regula is the chief scientist at PLX Technology. He can be reached at . For more information on PCIe switches with Read Pacing, go to .