Minimizing latency in diverse embedded system design environments - Embedded.com

Minimizing latency in diverse embedded system design environments

With speeds of up to 5.0GT/s for PCI Express (PCIe) Gen 2 and up to 8GT/s with the forthcoming PCIe Gen 3, it's easy to see why this powerful interconnect standard is used across multiple high-performance embedded applications, as well as in servers and storage systems.

These applications are designed to take advantage of PCIe's high-throughput capabilities but also face performance-limiting latency that's masked by the large amounts of data moving throughout the system. Device latency, therefore, plays a hidden yet critical role in how well embedded and communications systems perform.

This article will address how latency issues in embedded (and other) systems have been successfully and measurably countered in PCIe's first two generations and what designers can expect with PCIe Gen 3 on the horizon.

“High performance” in many instances is associated with high throughput. Though bandwidth and performance go hand in hand, there are other factors that make a significant contribution to system performance where high throughput is not part of the system application. In such applications, latency — specifically device latency — plays a larger role in the overall performance.

System latency can be narrowed down to two key contributors: the amount of time for an endpoint to respond to a read request, and the amount of time it takes a packet to traverse across a device such as a PCIe switch. For posted transactions, the system latency is the sum of the latency for the individual components.

For non-posted transaction, such as memory and configuration reads, the latency is doubled to account for the round-trip delay. Depending on the number of endpoints in a system, multiple switches can be cascaded to increase the number of PCIe connections. The more switches in a PCIe fabric, the higher the aggregate device latency.

Figure 1. Latency as measured in a PCIe switch

Protocol efficiency, although not directly impacted by system latency, also plays a key role with regards to the overall performance. PCIe is a packet protocol and each packet consists of a header, up to 4DW of which contains the routing information, and the actual data payload. A PCIe packet can have up to 4KB of data payload per the PCIe specification. Each PCIe packet, regardless of the data payload associated with it, must include a header.

Storage and server applications will move many megabytes of data through the system at any given time. A posted transaction from a Fibre Channel (FC) controller, for example, can transfer up to 4KB of data. This large transaction is broken down into multiple PCIe packets, depending on the PCIe maximum payload size (MPS), typically 128B and 256B (determined by the least capable device in the system).

The completion(s) associated with read requests are also broken down into multiple PCIe packets, depending on the system's PCIe MPS. In order to reduce the protocol overhead and increase system throughput, a FC controller will issue multiple outstanding requests at once in order to mask the system latency.

Communication systems consist of a data plane and a control plane. The requirements for the data plane in these systems are very demanding in terms of throughput and quality of service. Consequently, proprietary ASICs specific to the application are normally used. All said ASICs are also connected to a control plane, which is managed by a control processor.

PCIe is increasingly used in the control plane for communication systems. Unlike the data plane, the requirements for the control plane are not bandwidth- or throughput-intensive. Instead, being able to move small blocks of data to and from the PCIe endpoints in a fast and efficient way is of more important.

A control plane is used to issue configuration transactions to the configuration registers of all the ASICs. These transactions could include initial configuration of the device as well as other critical configuration information.

These configuration register accesses to the ASICs are small — typically 4 bytes at a time — and almost always to a large number of registers. During system run-time, the control plane is also used to gather status updates from the ASICs by reading the status registers.

Figure 2: PCIe in communications systems

During system run-time, the control plane is also used to gather status updates from the ASICs by reading the status registers. Both types of accesses require fast delivery of the data. Designers should keep in mind that the control processor issues both the configuration accesses and status updates to the ASICs.

When issuing a read request, the processor will wait on that request to be completed; it will wait for the read completion before it issues the next read request. This is done for every configuration and status update register being read as well as for every ASIC in the system.

It the example shown in Figure 2 above , reducing the system latency has a direct impact on system performance. The lower the system latency, the faster the data will travel across the PCIe fabric, thus reducing the amount of time the processor remains in an idle state waiting on completions.

This same model applies to embedded applications, where PCIe is used as the control plane — in some cases, as both the control and data plane. A printer application Figure 3 below is an example of where PCIe is used in both the control and data plane, and where the ASICs are controlled via the PCIe interface.

Figure 3: Printer block diagram

Today, Gen 2-based PCIe switches support cut-through technology, resulting in latencies as low as 140ns, which are the lowest available in the industry today.

The cut-through mechanism allows the ingress port of a switch to forward the PCIe packet to the egress port, even though the entire packet has not been received. In contrast, a store-and-forward switch will wait for the entire packet to be received, store it in its buffers, and then forward it to the egress port, thus increasing the overall device latency.

In addition to their low latency, these cut-through switches also provide a high number of ports, starting at four and up to 16, for increased connectivity. This increases the number of PCIe connections and removes the need for cascading multiple switches, resulting in a lower overall system latency. Although most PCIe devices in the market today support 128B and 256B data payloads, these PCIe switches support up to 2KB MPS today.

Furthermore, these devices also provide support for two Virtual Channels (VCs). Having multiple VCs allows differentiation of traffic for packets coming into the switch.

The higher priority packets will be assigned their own buffer structure separate from the rest of the packets. This provides a dedicated path throughout the switch which in turn results in lower latency for that particular traffic class thus eliminating any queuing delays in PCIe switches due to congestion.

The built-in DMA engine in some of these devices provides features and benefits for reducing the system latency as well. When used in systems where the data flows have been carefully considered, the DMA engine in the PLX switch provides enough outstanding read request tags to mask the roundtrip latency of a request-completion pair.

Figure 4: Control Plane application with 15 x1 endpoints

Summary
Of the two main factors determining overall embedded system performance — throughput and latency ” it's the latency that plays a more significant role. For this reason, there are PCIe switches available today with the industry's lowest latency, the largest number of ports for control plane applications, the highest number of Virtual Channels as well as a built-in DMA engine with multiple outstanding read request capability to mask latency.

These PCIe switches improve system performance by implementing low device latency and providing sufficient connectivity in terms of number of ports available thus eliminating the added latency for having to cascade more PCIe switches.

Although the features described in the previous paragraphs are essential in reducing overall system latency, ultimately, the key to defeating latency is structuring data flows so that the CPU or DMA engine does not have to be idle during the latency of its operations.

With Gen 3 in the horizon, the PCIe standard continues to make progress. PCIe switch leaders are once again improving on the switch architectures, which will result in higher bandwidth switches with significantly lower device latency and increased overall performance. Concurrently, designers have begun their efforts in developing Gen 3-based systems for embedded, storage, server, communications and other applications.

Miguel Rodriguez is senior marketing engineer at PLX Technology, Sunnyvale, Calif. He can be reached at mrodriguez@plxtech.com .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.