Using nextgen PCI Express switches to eliminate network I/O bottlenecks -

Using nextgen PCI Express switches to eliminate network I/O bottlenecks


Controllers in today's network-connected embedded systems often areoverwhelmed by the data streaming to and from the various I/O sources;it can be difficult for the system's root complex to absorb high-speedbursty traffic such as 10Gig Ethernet when it competes with very faststreaming data from sources such as InfiniBand and Fibre Channel (FC)storage elements.

For example, when a few bytes of Ethernet data get stuck behindlarge packets of FC data in the root complex, the latency that isintroduced by this congestion will severely impact system response timeand create bandwidth limitations (seeTable 1 below ).

Table1. Ethernet latency bandwidth tradeoffs

The next generation of PCI Express (PCIe) switches have added manynew features to mitigate the effects of having to process competingdata protocols, thereby improving overall system performance.

Advanced new features such as Read Pacing, enhanced portconfiguration flexibility, dynamic buffer memory allocation, and thedeployment of PCIe Gen2 signaling are reducing I/O bottlenecks,providing dramatic improvements in system performance in server andstorage controllers.

Performance Limited by “EndpointStarvation
When two or more endpoints are connected to a root complex through aPCIe switch, with unbalanced upstream versus downstream link-widths(and hence unbalanced bandwidths) and an uneven number of read requestsare being made by the endpoints, one endpoint inevitably dominates thebandwidth of the root complex queue. The other endpoints suffer reducedperformance as a result. This is known as “endpoint starvation,” whichcan make it appear as if the system is congested and not performingoptimally.

Figure 1 below shows atypical root complex connected to two endpoints through a PCIe switch.In this example, there is a x8 upstream port and two x4 downstreamports. The FC HBA is a good example of an endpoint that could dominatethe bandwidth of the root complex queues.

In this example, the FC HBA makes several 2KB read requests, whichare then queued by the root complex, filling up the queues in rootcomplex.

Figure1. Endpoint starvation

While the queues are full, the Ethernet NIC makes two 1KB readrequests. The Ethernet NIC must wait for the root complex to serviceall of the read requests from the FC HBA before they're serviced. Thusthe NIC is “starved.”

Read Pacing “Feeds” the StarvingEndpoint
Endpoint starvation is solved ” and the endpoint is “fed” — with a newPCIe switch feature called Read Pacing, which is available on thelatest Gen 2 PCIe switches.

Read Pacing provides increased system performance with a morebalanced allocation of bandwidth to the downstream ports of the switch.With Read Pacing, the switch can apply rules to prevent one port fromoverwhelming the completion bandwidth or buffering in the system.

Figure 2 below shows thesame example, with a FC HBA and an Ethernet NIC on the downstream portsof a switch which aggregates traffic into a root complex. The FC HBAmakes several 2KB read requests.

Figure2. Read pacing eliminates endpoint starvation

With Read Pacing, the switch controls the number of the FC HBA'sread requests forwarded through at a time. Programmable registers inthe switch control the number of read requests forwarded to the rootcomplex.

As the Ethernet NIC makes its two 1KB read requests, the switchallows both read requests through, thus balancing the flow of data fromboth endpoints. As shown in Figure 2 ,a 2KB read for the FC HBA through the root complex is immediatelyfollowed by two 1KB reads for the Ethernet NIC, resulting in balancedtraffic for each endpoint.

Read Pacing allows the Ethernet NIC to be serviced more frequentlywithout impacting the bandwidth of the FC HBA. Hence, endpointstarvation is eliminated with Read Pacing. The chart below compares theperformance improvement that can be achieved with and without usingRead Pacing in a real world system, where the FC issues 16 4K readrequests ahead of the Ethernet single 1K read request.

Increase Performance by OptimizingBuffer Size Dynamically
Early PCIe switch architectures provided each port with a fixed amountof buffer RAM. Figure 3 below comparesa typical type of buffer allocation, seen in the older switch designs,with the new Dynamic Allocation scheme found in the latest Gen2switches.

Figure3. Dynamic allocation leads to more buffers

In this example, a six-port switch is designed with a total of 30packet buffers, with five buffer segments available on each port. Ifonly four ports are used, then the buffers allocated to the two unusedports are wasted.

Since a larger buffer will translate into better performance, itwould be nice if that unused memory could be used to increase the sizeof the buffers on the four ports that are being used.

In the latest Gen2 switches, it is possible to do just that. Thisfeature is known as Dynamic Buffer Allocation, where a shared memorypool is available to any port, and the size of the buffer is allocateddynamically depending on the number of ports in use.

Increasing Performance by SizingBuffers Dynamically
Figure 4 below compares astatic buffer per port scheme with a Dynamic Scheme on a switch whichis configured with three differing port widths. Since the smaller widthports require less bandwidth than the wider ports, they should requirefewer packet buffers as well.

Figure4. Dynamic allocation allows more appropriate buffer sizing

In this example, a x8 upstream port is servicing three downstreamports, one a single x1 port, one a x4 port and third one a x8 port.With a static fixed buffer per port architecture, the x1 port isallowed the same buffer size as the x8 ports. Not only is this not theoptimal buffer assignment, but there are two unused groups of packetbuffers.

With Dynamic Allocation, buffers are assigned as needed to each portbased on the width of each port. Since there are no unused buffers, alarger total amount of buffer is available, increasing the size ofbuffer that may be applied in the ports that need the extra bandwidth.

In this example, in the bottom half of Figure 4 , ten packet buffers areallocated to each of the x8 ports, whereas six buffers are given to thex4 port and four buffers are available for the x1 port. Thus the amountof buffer available on a given port is dynamically assigned based onthe traffic loading on each port, resulting in higher overall systemperformance.

Real-World Implementation ofDynamic Buffer Allocation
A real-world implementation of Dynamic Allocation can be seen in Figure 5 below . Here, a 24-lane PCIeGen2 switch is configured with a x8 upstream port, a x8 downstream portand two x4 downstream ports.

Figure5. Dynamic allocation using a 24 lane switch

This switch's configuration has been set up by the user withassigned buffer space for each port and an uncommitted common (orshared) buffer pool per 16 lanes. The buffers have been assignedproportional to the port width, i.e., the x8 ports each have 10 packetbuffers, the x4 ports four each.

A common buffer memory pool is set up with five buffer packets foreach of the 16 downstream lanes. Each of the ports may dynamically grabbuffers as needed to support its own traffic bandwidth.

For example, a port may grab buffers when its assigned buffermemories are full; conversely, a port may return buffers to the poolwhen they are empty. This dynamic reallocation has two benefits inswitch design: it makes full use of the buffer memory on-chip and itrequires less overall memory to achieve optimal performance.

Port Flexibility ImprovesPerformance, Simplifies Layout
In the previous generations of PCIe switches, one port was fixed as theupstream port while all other ports were defined as downstream, withseverely limited lane count/port count combinations.

A new wave of PCIe Gen 2 switches now offers flexible and versatileport configuration schemes, with ports configurable as x1, x2, x4, x8,and x16 for maximum port bandwidth ranging from 250MB/s (x1 port, Gen 1signaling) to 8GB/s (x16 port, Gen 2 signaling), with several intervalsin between. This means it is easier to optimize lane bandwidth andpower dissipation and port layout trace-width from port to port.

In addition, these new switches support auto-negotiation of the portwidth, reducing the number of lanes that are active in a port down tomatch endpoints that are connected. For example, if a NIC with a x4port is connected to a x8 (or x16) port on the switch, the switch willautomatically reduce the number of active lanes for that port down to ax4 configuration.

Selectable Upstream Port SimplifiesHigh Performance Layout
These newer switches also support a moveable upstream port. Any port,in fact, can be defined as the upstream port in these devices. This canbe optimized to meet the needs of the traffic through each port of theswitch.

Additionally, the layout of a system board is enhanced by thisflexible upstream port assignment. Figure6 below illustrates how, in a storage application, a flexibleupstream port assignment allows spreading of high-speed traces evenlyon a system board with a 16-lane switch configured with one four-laneupstream (US) port and three x4 downstream (DS) ports. The system onthe left uses a switch with a fixed US port.

Figure6. Port flexibility enhances board layout

The fixed US port creates severe trace congestion since the DS portsare required to route through the SATA connectors, creating anundesirable crosstalk environment. The photo on the left shows the samesystem with a switch that has a flexible US port. This flexibilityallows the layout designer to avoid routing the high-speed PCIe lanesthrough the equally powerful SATA2 data paths, thus reducing crosstalk,enhancing signal integrity and improving transmission margin.

Dual Cast
In addition to balancing bandwidth and improved buffer allocation,these new switches also support Dual Cast, a feature that allows forthe copying of data packets from one ingress port to two egress ports,allowing for higher performance in dual-graphics, storage, security,and redundant applications.

Figure7. Dual cast fiber channel HBA

Without Dual Cast, the CPU must generate twice the number ofpackets, requiring twice the processing power. Figure 7 above illustrates aredundant storage array, where a PCIe Gen2 switch usesDual Cast to store data on two RAID disk arrays. Additionally, the samecard can be used for non-redundant applications

This new generation of PCIe switches supports Gen2 signaling, doublingthe throughput per lane of the previous devices. Furthermore, newdata-flow architectures are being deployed in these switches tooptimize the bandwidth and memory utilization while minimize latencyand power dissipation. Each of these features makes significantcontributions to dramatic improvements in system and I/O performance inembedded systems.

Steve Moore is senior product marketing manager at PLX Technology, Sunnyvale, Calif. He can be reached

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.