Free up bandwidth in PCI Express designs -

Free up bandwidth in PCI Express designs

Designs using PCI Express–for all the performance bang that comes from its 2.5 GT/s (gigatransfers per second) in Gen 1 and 5GT/s (Gen 2) speeds–usually address one target at a time. This means an identical write packet needs to be sent to two addresses sequentially , instead of simultaneously –a process that tends to exact a penalty on system throughput.

To counter this, the use of dual-cast techniques allow a PCI Express switch to make a copy of a posted write packet and send it to a second address, saving the original source the trouble of transmitting the same data twice, to two different locations.

Dual casting enhances performance in applications where data must be copied simultaneously to multiple locations within a PCI Express fabric–essentially doubling throughput between the originating source and the target addresses. Designers can implement dual casting to eliminate an unnecessary step and help boost system throughput–and get a little “free bandwidth” out of it.

Table 1 shows the standard type of system read and write cycles allowable in PCI Express. The transaction layer packets, or TLPs, are the payloads that communicate information between source and destination. It is this exchange of data back and forth for which the operating system executes work across the system.

Of the message types available in PCI Express, only a memory write and a message are posted transactions, which are a communication sequence for which no return action is required by the recipient. For example, a system message, such as a surprise error event, is a transaction for which the initiator can or will inform the target, but for which the initiator doesn't expect a return message; the initiator might no longer be functional so as to accept any possible return message.

Similarly, a transaction for which the user simply wants to deposit information, such as a write to memory, doesn't require a system response. Today's memory systems are very robust; if the integrity of the packet presented to the memory device can be guaranteed, there is little need to read it back for comparison. (Note that PCI Express provides a separate mechanism, called a replay buffer, which ensures safe link-to-link TLP navigation.) Hence, writing to memory requires traffic flow in one direction, with no requirement for a return action.

Non-posted transactions
In contrast, a posted transaction is one that requires a response by the recipient. A memory read is a request for a return of information. TLPs flow in both directions; the initiator first issues a read request and the memory then returns the requested information. The completer is obligated to “complete” the request. Interestingly, a read completion can be with or without data, depending on the request.

As noted in Table 1, two basic types of transactions perform system reads or writes: memory cycles or IO cycles. The use of IO cycles has a history in legacy (conventional PCI) devices. Conceptually IO reads and writes are targeted at registers typically within an IO device. An IO write will load a value within a register, then return a completion confirming delivery. IO writes are implicitly atomic transactions and are not the preferred means of high-bandwidth data delivery.

While IO reads and IO writes are part of the standard, usage of these mechanisms is really the result of legacy endpoint devices. PCI Express encourages the use of a memory write over an IO write because it is twice as bandwidth-efficient compared with the legacy approach. That said, PCI Express still maintains the ability to generate IO read and IO write cycles, and the compatibility to legacy PCI operation (through the use of a PCI Express-to PCI bridge).

Dual cast
Actions requiring both the initiator and the completer to return a response increase link traffic. Dual casting takes advantage of the same memory write principles and features.

Dual casting is the process of taking a single memory-write transaction targeted at two separate memory locations and performing the host operation in a single PCI Express transaction, rather than two. Because no return completion is necessary and link-to-link signal integrity is managed by the protocol itself, data delivery for memory write-intensive mirroring applications is effectively doubled.

Performance is boosted both in increased available source port bandwidth and also reduced CPU utilization. As embedded-system designers become aware of and use this feature, performance gains will be seen in graphics, storage, computing and imaging.

Dual casting is ideally suited to enhance the performance of dual-GPU systems. As Figure 1 shows, two GPUs are used to paint a single screen. Because graphics processing has a predefined set of steps running in parallel and in processing order, the CPU can simultaneously cast drawing commands to both GPUs for processing.

Each GPU then renders its specified portion of the screen. As shown in the figure, one per frame, GPU2 then transfers its image to GPU1 via the peer-to-peer PCI Express communication feature already built into the PCI Express switch.

GPU1 updates the screen with both images, providing more realistic, high-bandwidth video. By using dual casting, CPU utilization is reduced, providing more cycles for general-purpose processing of other activities.

A complete industry has arisen out of the need to have and ensure accurate copies of data generated within a server system. Disk mirroring (RAID 1) can be made easier, faster with the introduction of dual casting operation. The availability of dual casting coupled with the native speed of PCI Express can be used to mirror data without compromising system performance.

In standard mirror applications, the user can chose synchronous or asynchronous duplication. (Asynchronous means that no confirmation is made of the copy set of data, so as to maximize system performance.)

With dual casting implemented, there is no need for such a choice. There is no loss in data integrity nor any drop in performance as the mirror copy is created and verified the same as any other posted transaction. (Note however, a memory write is a standard PCI Express posted transaction and, as such, does not receive a completion TLP. The DLLP ACK/NACK protocol is still maintained to ensure packet integrity.)

Figure 2 highlights a dual-cast storage application where data is mirrored (see blue arrows) on separate systems. Additionally, system read availability is enhanced as each CPU can independently access the mirrored information.

View the full-size image

Non-transparent (NT) bridging is used to connect two PCI Express root complexes together without one device overwriting the allocated register space of the other.

Dual-hosted communications / redundancy
The same data mirror, as shown in Figure 3 , can also act as a redundant, dual-host communication system. A memory write from the Host 1 or endpoint can be mirrored in the redundant system. (Figure 3 shows a packet sent upstream to both hosts.) As the main processor or endpoint completes its activity, the memory write packet data is updated, and sent via dual casting across the link.

View the full-size image

A separate process completion message is sent via the NT link to the back-up processor. In the event the backup processor doesn't receive a process completion message within an expected timeout window, the backup is able to assume operation of the system. Minimal reinitialization is necessary because the data is available and duplicated for the second host.

Computing / Imaging
An additional avenue of performance enhancement is along the lines of GPU cluster-based computing. In these systems, a CPU manages the activities of several GPUs using specialized software for distributed computing.

In computational applications such as medical imaging, seismic analysis, fluid dynamics/dispersion, financial analysis, and computational biology, GPU-based processing provides a significant advantage in processing power, size and power consumption over standard CPU clusters.

With PCI Express now the de facto interface for all high-performance GPU cards, dual casting provides another means of potentially improving performance in these applications.

Free lunch?
“Free bandwidth” may sound like a bold statement, but I believe that dual casting and, in particular, PLX's Dual Cast greatly increases the value of PCI Express in many bandwidth-demanding applications, including graphics, storage, computing, communications and embedded systems.

With the availability of PCI Express Gen 2 switches, embedded system designers using Dual Cast will continue to extend the performance and value of systems based on this powerful, versatile interconnect technology.

Reggie Conley is director of applications engineering at PLX Technology, Sunnyvale, Calif. (> PLX specializes in PCIe switches and bridges; Dual Cast is a patented product from PLX.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.