PCIe, the serial interconnect upgrade to the bus-based PCI and PCI-X standards, was defined to provide increased, scalable performance and alleviate the signal-integrity and board-layout issues introduced by the historical widening of the parallel buses.
The need for this upgraded interconnect standard was most acute in demanding desktop and enterprise computing, and storage applications, and this necessity drove the PCI Special Interest Group (PCI-SIG) and component suppliers to tailor the initial specification and system solutions to meet the emergent requirements of these applications.
But, despite the overwhelming momentum of PCIe and its ubiquity in computing and storage applications, adoption in embedded and communications applications has been largely limited.
Historically speaking, the adoption of new interconnect technologies in these applications trails adoption in other markets due to longer design cycles and product lifetimes. Additionally, in the case of the transition to PCIe, the embedded and communications markets have trailed in their transition as PCI and PCI-X, which is predominantly used in the control plane, continued to ably meet system performance needs.
Today, however, next-generation designs and product refreshes naturally gravitate toward PCIe to leverage the rich ecosystem of off-the-shelf processor, peripheral, and switching solutions featuring PCIe as the native interface, but adoption still remains limited as the PCIe specification and the PCIe-based interconnect and switching solutions had to evolve to address the specific needs of these markets.
The PCIe specification defines a tree-based PCIe topology with a single root and multiple leaves that is well-suited for efficient connectivity between a single computing complex and its associated local I/O. This structure–perfect for server and storage applications–does not contemplate or readily lend itself to system interconnectivity in multi-root systems.
Advanced communications and embedded systems often feature distributed computing and intelligence and, over time, had adapted PCI and PCI-X constructs to support such architectures. Adoption of PCIe as the primary system interconnect requires extensions to the specification to support constructs for optimized resource utilization, efficient data transmission and sharing, and system coherency between peers in multi-root systems.
Work has been ongoing within the PCIe ecosystem to extend the PCIe specification to meet the needs of demanding embedded and communications applications. This work is progressing with a critical eye on implementations that enable the desired feature extensions without adding any burden to the extensive user base or requiring any changes to the existing ecosystem or usage models.
In May 2008, the PCI-SIG added multicast capability to the PCIe standard through an engineering change notice (ECN) to revision 2.0 of the PCIe base specification. This added capability provides powerful functionality for data movement and sharing among distributed system elements and removes a key barrier to adoption of PCIe as the primary system interconnect in demanding embedded and communications applications.
PCIe Multicast optimizes system resources and enables efficient data transmission to multiple system elements with reduced latency and increased coherency. Importantly, the implementation of PCIe Multicast provides these key system benefits as extensions to the existing PCIe specification while adding no burden and requiring no modification to the existing ecosystem or usage models.
System benefits of multicast
Multicast is defined as the means for delivery of quantized data packets to a group of destinations simultaneously while efficiently managing resources and system bandwidth by avoiding any unnecessary data duplication.
In a system with distributed or replicated intelligence, such as those found in embedded and communications applications, multicast functionality provides an efficient one-to-many data distribution mechanism for such tasks as sending simultaneous boot-time or reset commands and images to reduce the duration of reset sequencing and system downtime as well as the simultaneous updating of critical routing and policy control information to ensure system data coherency.
System resource optimization through the reduction of overhead in transmission of the same data to multiple recipients via multicast is illustrated in Figure 1 . In this simplistic model, a single PCIe switch without multicast is contrasted with one that supports multicast.
The multicast-equipped system/switch provides more efficient resource utilization by replacing four sequential, looped transactions with a single multicast transaction managed by the switch, allowing the initiating CPU to resume other tasks more expediently. This efficiency can be leveraged to boost performance as the system compute resources can take on additional tasks, or to drive system cost and power savings as lower-performance or fewer-compute resources are needed.
In addition to the optimization of system resources, the conversion of looped unicast transactions to a single multicast transaction reduces delivery latency to increase coherency amongst system peers. Revisiting the simple model in Figure 1 , assuming the sequential transmission of data occurs as numbered, the fourth endpoint is out of date relative to the first endpoint until all interim iterations are complete.
In a single-root system, such a gap is relatively benign as most data is transmitted by the single host and the four transactions depicted would have completed before further action is taken. However, in systems with distributed intelligence and significant peer-to-peer traffic, the gap introduced by iterative unicast transactions creates potential data-ordering concerns as packets may be delivered to each of the endpoints at relatively different times than each endpoint's receipt of the looped unicast packet.
Consider a scenario where the information being sent to the endpoints via the loop is an update to a routing table to each of four network processing units (NPUs) on packet-processing line cards in a communications system. The table updates for each endpoint have increasing latency based on the sequential delivery and completion, and the resultant gaps allow the line cards “awaiting” their update to continue to route received packets based on the now out-of-date table.
Multicasting data additionally results in better link utilization throughout the system, often removing bottlenecks for increased performance or allowing the provisioning of smaller, more efficiently used links to save power and reduce the complexity of the board design. The PCIe multicast protocol, adhering to the definition mentioned earlier, makes copies of data only when “branches” are taken.
Figure 2 depicts a PCIe interconnect structure with multiple hops possible through the pair of PCIe switches. Sending identical, looped unicast data to Endpoints 1, 2, 3, and 4 as shown on the left, results in Links 1 and 4 being traversed multiple times by the same data. The system on the right leverages the PCIe multicast capability and transmits the data only once on each link with distribution managed by the multicast functionality in each case where the data would need to logically be copied.
Implementing PCI e multicast
As discussed earlier, the rapid and widespread adoption of the PCIe standard has led to a rich ecosystem and vast user base that often relies on the previous versions of the specification. Therefore, PCIe Multicast, like any proposed extension of the specification, must not have a negative impact on or burden the existing ecosystem or usage models. Specifically, PCIe Multicast was defined as to not require hardware modification for existing root complexes or endpoints or new transaction layer packet (TLP) formats.
Optimizing functionality subject to these constraints, PCIe Multicast is defined as an address-based multicast functionality that uses a segment of the common PCIe memory space and a simple programming model to route standard PCIe posted memory-write TLPs to multiple recipients in up to 64 multicast groups (MCGs).
Although a MCG can contain zero or one member, systems realize benefits only from MCG sizes greater than two and thus a PCIe switch (or series of switches) must be used to provide the connectivity between the initiator and the MCG members. Multicast traffic can be initiated by any device in the PCIe hierarchy and transmitted to any number of participants attached to a switch port with a Multicast Capability Structure.
A switch with a Multicast Capability Structure for every port is capable of multicasting packets from any of its ports to any of its ports. Root complexes and endpoints can realize benefits from the inclusion of Multicast Capability Structures, but this is an optional feature for those devices.
Following system enumeration, which is not affected by the inclusion of the multicast capability, system software configures the multicast address space by opening a multicast window in the PCIe memory space beginning at the Multicast Window Base Address as shown in Figure 3 .
From the base address, the multicast window is configured as a contiguous address range that may be divided into sub-ranges of equal size for each of up to 64 supported MCGs. There are no practical limitations on the size of the multicast window (the multicast window can have up to 2^63 bytes). The number of groups in the multicast window may be configured in the range of 1 to 64.
PCIe devices in the system that support multicast are required to have a Multicast Capability Structure for each PCIe function enabled for multicast. In a PCIe switch, this requires that each switch port transmitting or receiving multicast data must support a Multicast Capability Structure in its associated virtual PCI-to-PCI (P2P) bridge.
Within each Multicast Capability Structure, identically configured control registers maintain the following information: multicast window base address, number of MCGs, and MCG window size. Additionally, each capability structure holds an independently configured and maintained vector of 64 control bits that enable or disable receipt of TLPs from MCGs 0 through 63 and thus govern the membership of each PCIe function in each of the MCGs. The registers and control bits are readable and writable at any time during device operation.
Transmission and routing of multicast PCIe TLPs vary slightly from unicast TLPs. Illustration of these differences can best be described with the help of the basic functional PCIe switch diagram in Figure 4 . Upon receiving a TLP from the root, address decoding at the ingress port determines that the TLP is a multicast TLP (logically speaking, this is where the initial transaction becomes multiple transactions).
Decoded multicast TLPs without errors are forwarded to the switch's virtual PCI bus. Unlike unicast traffic, which has different routing rules depending upon whether the TLP was received on the primary or secondary side of the P2P bridge, multicast TLPs are symmetrically forwarded without regard to whether the P2P bridge is associated with an upstream or downstream port.
All switch ports (in other words, P2P bridge functions) connected to the virtual PCI bus receive the multicast TLP and examine the MCG ID in the address against the status of the bit in the multicast-receive-enable vector. The multicast-receive-enable vector indicates, on a per-MCG basis, if a P2P bridge function is allowed to forward the TLP toward its destination. This allows each P2P bridge function in the switch to register itself as a recipient of multicast TLPs on a per-group basis.
Once a P2P bridge function within the switch accepts a multicast TLP, it performs egress processing on the TLP. Egress processing of multicast TLPs will vary depending on the capabilities of the link partner on the switch port associated with the P2P bridge. In the case that the link partner is provisioned with a Multicast Capability Structure, such as the PCIe switch to PCIe switch transmission shown in Figure 5 , the TLP is forwarded for further routing without modification. As previously noted however, PCIe endpoints are not required to implement Multicast Capability Structures to receive multicast TLPs.
To support endpoints without multicast capabilities, system software must guarantee that the base address registers of the endpoint overlap some portion of the multicast address range or the PCIe switch must employ the optionally specified multicast overlay mechanism.
As the task of ensuring overlap between the endpoint's function address range and the multicast address range places a burden on the system designer and may require unique code bases for every product SKU, well-defined switch implementations will implement the multicast overlay feature within the Multicast Capability Structure on every switch port to enable maximum flexibility and leverage currently available endpoints.
The address overlay functionality, as depicted in Figure 6 , is a mechanism that may be used to remap the address of a received multicast TLP from the multicast window to the endpoint's base address register (BAR) window. The address overlay is performed by the switch ports.
Each switch port may be configured with a different address overlay value to allow independent mapping into the BAR window associated with each endpoint. Conversion between 32-bit and 64-bit addresses is supported (for example, the multicast region may be located above the 4-GB boundary, and the endpoint's BAR may be below the 4-GB boundary, or vice-versa).
The recent addition of multicast capability to PCIe provides a necessary extension to enable optimized system resource utilization, increased performance through decreased system latency, and efficient and coherent data transmission between peers in multi-root embedded and communications systems. With a critical eye toward creating no burden to the existing PCIe ecosystem or user base, the extended functionality requires no changes to root or endpoint hardware and no new TLP formats. The address-based PCIe multicast implementation with its simplistic programming model is enabled within a PCIe switch and offers functionality and flexibility well beyond previously attempted, proprietary implementations, such as dualcasting.
Matt Jones is a product marketing manager in the Enterprise Computing Division at IDT. Matt holds a BS in electrical engineering and a BA in economics from Stanford University.