PCI Express (PCIe), like the legacy PCI bus it evolved from, wasarchitected to serve as a simple DMA I/O subsystem for a single hostprocessor.
And like PCI, it's already being used in a much wider variety ofapplications and usage models, many of which require support formultiple processors. Not long ago, only the non-transparent bridge wasavailable to address this need, but there are now a number ofalternatives available.
Address-translation capabilities now available in some rootcomplexes make the crosslink at least marginally useful forhost-to-host communications. Additionally, there are now embeddedprocessors with native PCIe interfaces that, in effect, include a non-transparent bridge.
Furthermore, on the horizon is PCI-SIG standard for multi-root shared I/O and multi-root I/O virtualizationcalled MRIOV, which is easily extended tosupport host-to-host communications.
At its inception, the crosslink was the only option in the PCIespecification that began to address the problem of interconnecting hostdomains. The crosslink is a physical layer option that works around theneed for a defined upstream/downstream direction in the link trainingprocess.
The crosslink option allows a link to take the lead upstream role intraining if, after a random delay, training hasn't been initiated byits link partner. This means that the link will come up at the physicallayer, but doesn't itself guarantee the ability to communicate at thetransaction layer.
|Figure1: Crosslink Connecting Two Host Domains|
The ability to communicate through a crosslink depends upon havingdifferent, non-overlapping address and Requester ID (RID) maps on eachside of the link. Only addresses that don't correspond to physicaldevices or memory in the domain on one side will be routed through thecrosslink.
Furthermore, the address or ID must fit within the ranges decoded bythe base and limit registers of the PCI-to-PCI bridge on the near sideof the crosslink. Since it is normal practice to assign a Requester IDof zero to transactions originating at a root complex and to mapphysical memory starting at a base address of zero, the usefulness of acrosslink can be severely constrained.
Allowing the two processors with identical memory maps tocommunicate requires translation. Root complexes capable of translatingan address coming up from the PCIe tree are being defined to supportI/O virtualization.
With this capability in place, an alias can be established for eachprocessor's memory in the address space of the other processor. Thealias is then configured into the base and limit registers of the redand blue bridges in Figure 1 above. Whena packet containing the alias address reaches the RC, it is translatedthere to an address within the local memory. With address translationalone each processor can write, but can't read, each other's memory.
If both processors use a RID of zero and a read request is sentthrough the crosslink by one processor, its completion will be routedto the other processor, and result in an unexpected completion error.We must look to non-transparentbridging (NTB) for the requisite address and RID translationthat allows unrestricted communications between host domains.
Given the shortcomings of the crosslink and the legacy of NTB on PCI,it's only natural that NTB would be re-invented for PCIe. Anon-transparent bridge is two PCIe endpoints connected back to backsuch that the base address registers (BARs) in each endpoint are usedto create apertures into the address space on the far side of the otherendpoint, as shown in Figure 2 below.
Non-transparent bridges are being widely used to support mirroringfor storage applications, dual-host/failover for embedded andcommunications systems, fabric interfaces for intelligent adapters, andless widely to create bladed compute servers. Address and RIDtranslations on packets passing through the non-transparent bridgeremove all the communications restrictions described above for a simplecrosslink.
|Figure2: The Non-transparent Bridge|
The PCIe interface of each side of the non-transparent bridge isdefined by the PCIe specification of its Type 0 CSR header. Thisspecification allows as many as six 32-bit BARs, which may be used inpairs to create 64-bit BARs, to be implemented.
Industry practice, as exemplified by PLX Technology's family of PCIeswitches that contain a single non-transparent bridge switch port, isto use the first pair of BARs to provide memory and/or I/O mappedaccess to the non-transparent bridge registers themselves, and theremaining two BAR pairs to create inter-domain windows.
In addition to the Type0 CSR header registers, a non-transparentbridge contains setup and translation registers for defining the sizeand address translation associated with each aperture and Requester IDtranslation lookup tables and scratchpad and doorbell registers to aidinterprocessor communications.
In operation, packets that pass through a non-transparent bridgereceive both an address translation and a RID translation. In thelatter step, the bridge substitutes its own ID for the original RID ina request packet, then restores the original RID if and when acompletion packet returns through the bridge. The simplest form of theRID translation algorithm supports eight requesters on the local sideof the bridge and 32 total fabric nodes.
Processors with Native EndpointInterface
It's always been possible to utilize multiple processors within asingle PCIe hierarchy, provided, of course, that all processors exceptone utilize an endpoint instead of a root complex interface and onlythat one host processor sends configuration space transactions into thePCIe fabric.
Until recently, processors with integrated NTB didn't exist. Now,both the Freescale MPC8641 andAMCC PowerPC 440SPe includenativePCIe interfaces that can interface either as a RC or an EP. With these,you can build PCIe backplanes with a processor in every slot withoutrequiring non-transparent bridging in the PCIe fabric.
Limitations of Non-TransparentBridging
Regardless of whether the non-transparent bridge is integrated in theprocessor or on the switch, the technique creates a single globaladdress and ID space that is the extension of its host's spaces. Eachnon-transparent bridge acts as a fence, isolating the processor domainbehind it and terminating discovery operations from both sides.
For dual-host and intelligent adapter applications, this is exactlywhat is required. For blade servers, the non-transparency is animpediment that must be worked around with software. Blade serversusing PCI Express need to be able to configure I/O devices reached viathe fabric.
Doing this through a non-transparent bridge requires thatconfiguration space transactions be trapped and converted to memoryspace operations after a management entity has created a global memorymap containing all device CSRs .
The Partitioned Switch
A different approach to supporting multiple processors is to partitiona switch into two or more isolated host domains. A single largerpartitioned switch functions exactly like multiple smaller switches.
Use of a partitioned switch can provide cost, power, and board areasavings when a system contains multiple host domains that do not needto communicate via PCIe. Flexibility and system resiliency are improvedwhen partitions are programmable.
Such switches have application in embedded backplanes, such asMicroTCA, allowing multiple hosts to be connected with their I/Odevices over a common backplane. Partitioning provides some of theadvantages of MR IOV, but does not allow I/O sharing. NTB can beprovided between host partitions for host-to-host communications andfailover.
The MR IOV Specification
The PCI SIG's draft MR IOV specification defines a (nearly) software-transparent multi-processor fabric that supports the sharing of I/Odevices among multiple root complexes and the virtualization of I/O. Itdoes this by adding a DW to packet headers on the links betweenswitches and between switches and endpoints that identifies the virtualhierarchy (e.g. the host domain) that the packet is traversing.
In this approach, the virtual hierarchy of each host extends throughthe fabric to each endpoint. An MRA IOV (multi-root aware IOV ) switchcontains a virtual fan-out switch rooted at each host port. Transactionlayer packets are routed between RC and EP without change, allowingtheir ECRC to remain a complete end to end data integrity check.
|Figure3: MR IOV Fabric Showing Multiple Virtual Switches in an MR IOV Switch|
The compelling advantage of MR IOV as a blade server backplane isthat it allows IOV devices, particularly those for storage andnetworking to be removed from each blade and replaced with singleshared interfaces reached by a software-transparent PCIe fabric. Thisreplaces separate storage and network fabrics with a single PCIefabric, as well as eliminating the cost, power, and board area ofseparate I/O adapters on each blade.
The MR fabric is software-transparent only after a management andconfiguration agent called MR PCIM has configured the fabric andcoordinated the sharing of I/O devices by dealing with resourceassignment and allocation. Once this is done, each host may enumerateits own virtual hierarchy and configure the virtual I/O devicesallocated to it by MR PCIM using standard PCIe enumeration software.Device drivers may need minor tweaks but no architectural changes. Avirtualization intermediary may be required on each blade.
Host to Host with MR IOV
The MR IOV specification doesn't specify a mechanism for host-to-hostcommunications. This complex and controversial feature was omitted inorder to speed closure on the remainder of the specification. PLXTechnology proposes to support a global shared address space forhost-to-host communications in its future MR IOV switches.
This will allow individual blades to map their local memory into theglobal space, with protection provided by an Address Translation andProtection Table in their North Bridges and by support of the AccessControl Services ECN in the switch.
The switch will provide the address decoding resources to create aglobal memory map, such as that shown in Figure 4 below , and NTB-like addressand requester ID translations for requests that pass between hostdomains. The global shared space is positioned well above that used formemory mapped I/O, allowing it to coexist with shared and virtualizedI/O devices in the same backplane.
|Figure4: Global Shared Memory Space for Host-to-Host Communications|
The shared memory support implemented in the switch enableshost-to-host communications. Higher-level protocols can be built on topof this. The Geneseo initiativeof the PCI SIG will provide a co-processor/accelerator interface builtupon a shared memory model. DMAcontrollers available in some NorthBridges can be harnessed to implement a messaging system,perhaps employing an Ethernet-like API.
With a mix of standard and proprietary extensions, PCIe now spansthe range from a uni-processor I/O interconnect for desktop systems toa backplane fabric supporting multiple processors for communications,computation and control applications. It is the only serialinterconnect needed to support “inside the box” designs.
Jack Regula is chief technologyofficer, PLX Technology, amaker of PCIe and other standard interconnect products. An inventor ofearly switch-fabric technology, Jack has delivered several technicalpresentations at industry conferences and authored multiple technicalarticles in leading trade publications.
To read more about this topic go to Moreabout multiprocessing,multicores and multithreading.