In embedded designs, enterprise systems and data centers, managers typically look for ways to optimize the use and improve the performance of capital equipment. In the recent past, we have seen the introduction of several methodologies intended to maximize available system resources, such as cloud computing, grid computing, utility computing, cluster computing, server virtualization, and I/O sharing.
Whether it be compute power, storage capacity or bandwidth of the interconnect devices on the switch fabric, each one of these methodologies attempts to get more for less.
The majority of the current network server infrastructure is based on x86 server architecture and RISC CPUs, Ethernet and Fibre Channel (FC) switches, and FC/Ethernet I/O devices. PCI Express (PCIe), which has been commonly perceived as a chip-to-chip single-host interconnect technology, is quietly making headway into switch fabrics that can replace current server and storage fabrics.
This article will discuss how the introduction of new PCIe multi-root (MR) switches impact the development of future switch fabrics for enterprise systems and data centers.
Let's look at the “pressure points” of an information technology professional tasked to serve a broad client base, making it necessary to administer all of his/her clients' applications, as well as ensure efficient management and maintenance of the data center equipment.
In data centers, servers and storage systems are connected together and to the outside world through routers, switches and directors. Traditionally, data centers use dedicated servers for each application and dedicated switching systems for each traffic type. However, most applications do not run 24/7, leaving servers and switches underutilized.
This traditional method raised several concerns, such as a low rate of return on expensive equipment, highly inefficient use of power caused by underutilized machines, and the large amount of valuable physical space required by these data centers. Running dedicated servers for each individual application/service and a segregated network for each poses another set of challenges from a management perspective, as each application/service calls for a unique skill set.
Ideally, system managers would like to have flexibility to access any host, I/O device, application, database or other resource instantly and reliably. Although we are not quite there yet, capabilities to achieve this goal are being realized in small increments. For example, software-based virtualization techniques started to roll out in 2007 and are now broadly deployed.
This technique, as illustrated in Figure 1 below , enables the use of a single host for multiple applications. This new virtualization-through-software technique also allows for the splitting and executing of a single task on multiple host CPUs or moving tasks from one CPU to the other. It illustrates an environment running multiple systems images (SIs) and applications on a single-host CPU.
|Figure 1. Software-based virtualization techniques|
Although each server blade physically houses only one network interface card (NIC) and one host bust adapter (HBA), the software creates multiple virtual NICs and HBAs for each interconnect technology (Ethernet, FC, etc.)
This works well for serving multiple applications on each server blade but burdens the host CPUs with the task of running multiple virtual NICs and HBAs. Additionally, each server blade still uses a dedicated NIC and HBA as required by the traditional approach.
PCI Express MR-IOV Solution
As noted earlier, PCIe has been perceived as a single-host interconnect technology. But thanks to the recent release of new standards and off-the-shelf silicon solutions, PCIe is now a feasible solution in multi-host systems as a switch fabric technology for data centers and enterprise IT applications. The presence of native PCIe interfaces (ports) in all servers and I/O sub-systems makes PCIe a very attractive candidate for future switch fabrics in small and mid-sized clusters.
In 2007, the PCI-SIG (the consortium responsible for PCI, PCI-X, and PCIe) released the Single-Root I/O Virtualization (SR-IOV) specification that enables the sharing of a single physical resource such as a NIC or HBA in a PCIe system among multiple SIs running on one host.
This requires I/O vendors to develop devices that support the SR-IOV specification. This is the simplest approach to sharing resources or I/O devices among different applications. It solves the problem of the CPU being burdened by having to run the software for virtual I/Os. Instead, it moves the burden to the SR-IOV-capable endpoints.
Last year, the PCI-SIG completed work on its Multi-Root I/O Virtualization (MR-IOV) specification that extends the use of PCIe technology from a single-host domain to a multi-root (i.e., multi-host) domain. The PCI and PCIe standards are developed around a single host model where one host controls and uses all I/Os in its domain.
The MR-IOV specification enables the use of a single I/O device by multiple hosts and multiple system images simultaneously, as illustrated in Figure 2 below . This illustration shows a multi-host environment where a single MR-IOV-capable NIC and HBA are shared across multiple systems via an MR-IOV switch.
|Figure 2. MR-IOV specification enables the use of a single I/O device by multiple hosts and multiple system images simultaneously.|
Such a solution would unify the servers, switch fabrics and I/Os used in data center applications. In this approach, servers would run multiple applications and SIs, MR-IOV-capable PCIe switches would provide connectivity, and MR-IOV-capable endpoints would provide the sharing and virtualization of I/O resources.
In order to implement MR-IOV specifications, three components of the system need to be developed ” MR-IOV PCIe switches, endpoints, and management software. All three of these components must be available simultaneously and work seamlessly.
Figure 3 below illustrates an example of a PCIe MR-IOV switch. This switch would allow association of a switch downstream (or endpoint) port to any of the host ports. The endpoint connected to a downstream port can be legacy, SR-IOV- or MR-IOV-capable, but the endpoints servicing multiple hosts must be MR-IOV-compliant.
|Figure 3. Example of a PCIe MR-IOV switch|
Unfortunately, the complexity and manageability of the MR-IOV switches and endpoints make the implementation of the MR-IOV standard very difficult, as no single vendor has the capabilities and the commitment to develop a completely seamless solution.
Unless the system (software) vendors, PCIe switch vendors, and PCIe I/O (endpoint) developers collaborate and commit resources to develop and deliver an interoperable solution, we will not see an MR-IOV implementation anytime soon.
MR PCIe Switch Solution
As the market was looking for an MR-IOV solution, PCIe switch vendors have created MR PCIe switches that allow the implementation of a subset of the features MR-IOV offered, which some industry experts have dubbed a “poor man's MR-IOV.” (In our case, PCIe Gen 2 MR switches range from 96 lanes and 24 ports to 48 lane and 12 ports, which give system designers quite a bit of latitude in MR implementation. )
This new generation of MR switches allow multiple hosts to be connected to a single switching device, which can be portioned under user control in such a way that each host will be connected to a desired set of downstream ports of the switch.
In this MR switch solution, each host operates independent of other hosts and controls I/Os in its domain without seeing or impacting traffic associated with the other hosts. Figure 4 below illustrates the internal architecture of an MR switch where particular sets of downstream ports are associated to particular host ports through partitioning (as shown by the dotted lines in Figure 4 ).
|Figure 4. Internal architecture of an MR switch|
The internal peer-to-peer (P2P) bridges, virtual bus, and downstream ports for a specific host port are completely isolated from the other host ports and their associated P2P bridges. These switches can either be used as standard PCI Express switches or partitioned in up to eight independent switches under user control (configuration).
The switches can be configured through in-band configuration by the default host, the I2C interface or an optional eeprom download. Some common configurations are also supported through strapping options.
In order to take advantage of the advanced features of the MR switch, the user must denote one of the host or I2 C ports as the manager of the switch. Although the configuration of the switch portions is done statically at boot time, the user has the ability to change the partitions of the switch or move ports from one partition to another under management control.
However, caution must be exercised when the port associations are being changed, as some request or data may still be in transit. Once a port has been moved from one partition to the other, the new host must enumerate the device into its domain.
The reliability and redundancy of a system is extremely important in today's computing infrastructure. These MR switches offer two levels of redundancy or failover. In the normal PCIe switch mode, these devices offer an on-chip standard NT port that allows a secondary (active or passive) host device for failover.
In MR switch mode, 1+1 or N+1 failover capabilities are provided to support high availability and redundancy. In 1+1 failover configuration, two hosts are paired to provide either active-active (both hosts are active) or active-passive (one host is active while the other waits in standby mode until the primary host fails) failover capability. This can be repeated to create multiple 1+1 host failover ports, as illustrated in Figure 5 below .
|Figure 5. Multiple 1+1 host failover ports|
In N+1 failover mode, one host may be configured as (active or passive) backup for N number of hosts, as illustrated in Figure 6 below . The user can also configure a backup host for one host that has been designated as a backup host for the rest of the hosts to provide an additional layer of protection ” essentially backing up your backup.
In both N+1 and 1+1 failolver, hosts exchange heartbeat and status information through mailbox or scratchpad registers designated for this purpose. Heartbeat and scratchpad registers are readable from both sides and they can be accessed as memory or I/Os.
These registers provide a mechanism to generate software controlled interrupts for the failover event generation. When failover occurs, the host port of the failing switch partition is disabled and its downstream ports are moved to the backup host's partition and a failover event interrupt is generated. The back-up host will then be able to enumerate the new downstream ports and associated endpoints.
|Figure 6. In N+1 failover mode, one host may be configured as (active or passive) backup for N number of hosts.|
It is important to note that while the failover mechanism is being executed with the MR switch, the other partitions of the switch continue to operate normally with no traffic interruption or status change. Also, the original downstream ports of the backup host will continue to operate normally while ports from the failing hosts are being moved from one host domain to the other.
As a result of the PCIe efforts of PLX and other silicon vendors, have enhanced system performance and reduced the acquisition and operating cost of embedded systems. With MR switches, these vendors have created a practical solution to address multi-host, fail-over, and IO sharing needs.
The industries and applications waiting for a full MR-IOV implementation do not have to wait any longer. Instead, they can start implementing key MR-IOV functions through MR switches today. These high-performance switches will allow system developers to service embedded systems' needs more effectively and efficiently than the traditional methods being used in today's systems.
Akber Kazmi is marketing director for PCI Express switching products at PLX Technology.