Some Hypervisors, including Xen and VMware ESX Server, discussedbriefly in Part 1 in this series ofarticles, allow the direct assignment of PCI devices to Guest VMs. Thisis a relatively small extension to the techniques needed to run devicedrivers inside the Management VM. Assigning PCI devices directly toGuest VMs eliminates the remaining overhead and added latencies of theMQ NIC IO virtualization approach.
Figure 3, below shows the common architecture for howhypervisors support PCI device assignment: The hypervisor providesmechanisms to directly access a PCI device's hardware resources and theManagement VM needs to provides a way for Guest VMs to discover PCIdevices assigned to them and their associated resources.
Device discovery by a Guest VM is typically achieved by providing avirtual PCI bus. The Management VM normally owns the physical PCI busesand enumerates all physical devices attached to them. If a PCI deviceis assigned to a Guest VM it is enumerated on a virtual PCI (vPCI) busexported to the Guest VM.
This allows the guest to access the PCI configuration space of thedevice assigned to it. Importantly, all PCI configuration spaceaccesses by a Guest VM are transferred to the Management VM which can,either pass them through to the device, intercept and emulate them, ordiscard them. This also allows the Management VM to enable or configurehardware resources required by the Guest VM to use the device.
|Figure3: PCI device assignment. Guest VMs can directly access hardwaredevices, eliminating all IO virtualization overheads.|
There are three different types of hardware resources a Guest VMmust have access to in order to run a device driver for a physicaldevice: device memory, device IO ports, and device interrupts.
The first two, device memory and IO ports, are described in thedevice's PCI configuration space as Base Address Registers (BARs). Inorder for a Guest VM to access device memory, the Management VMinstructs the Hypervisor that a given Guest VM is allowed to map thephysical addresses at which the device memory is located into itsvirtual address space.
The hypervisor can use memory protection provided by the CPU MMU(Memory Management Unit) to enforce that a Guest VM only accesses thedevice memory belonging to the assigned device. Access to IO ports canbe restricted in a similar way using the Task Segment Selector (TSS) onx86 processors.
Physical interrupts originating from a device need to be handled bythe Hypervisor as interrupts are only delivered to the highestprivileged software entity. Hypervisors then virtualizes the physicalinterrupts and deliver them to the Guest VMs.
In order to reduce interrupt latencies it is important that physicalinterrupts are delivered to the same CPU core that the destinationGuest VM is using to handle the resulting virtual interrupt.
In the MQ section above we argued that descriptors need to be passedthrough the Management VM to prevent breach of VM isolation due torogue DMA setups. This is not required for PCI device assignment sincemodern chipsets include IO MMUs, such as Intel's VT-d, which can besetup by the Hypervisor to allow a device to access only certain pagesof host memory.
This is achieved by setting up a page table mapping in the IO MMU tomap host memory into a device's DMA address space. On memory write andread requests from a PCI device to or from host memory, the chipsetsselects a IO MMU page table based on the Requester ID used by the PCIdevice.
Thus, the Hypervisor sets up the IO MMU page tables for a device tomap only the memory belonging to a Guest VM when the device is assignedto it. This prevents a Guest VM intentionally or accidentally accessingother VMs memory areas via a device's DMA engines.
Of all the IO virtualization options, direct PCI device assignmenthas the lowest overhead and the least added latencies. The ManagementVM is not involved in the data path; it just provides infrequent accessto the device's PCI configuration space.
The Hypervisor itself is only involved in the virtualization ofdevice interrupts, which can be achieved with relatively low overheadespecially if physical interrupts are delivered to the same CPU coresthe recipient Guest VM is executing on.
However, it is clearly infeasible to have a separate PCI device forevery Guest VM in a system, even if multifunction devices were used.The PCI-SIG introduced SR-IOV to address this issue.
Introducing PCIe SR-IOV
The PCI Special Interest Group (PCI-SIG) introduced the SR-IOV standardin September 2007 recognizing the need to provide a device centricapproach to IO virtualization.
As such, the SR-IOV standard builds on top a wide range of existingPCI standards including PCI Express (PCIe), Alternative Routing ID(ARI), Address Translation Services (ATS), and Function Level Reset(FLR).
From the host perspective, SR-IOV on its own is primarily anextension to the PCI configuration space, defining access tolightweight Virtual Functions (VFs). With SR-IOV a physical PCI devicemay contain several device functions. In SR-IOV parlance these arecalled Physical Functions (PFs).
PFs are standard PCIe devices with their own full PCI configurationspace and set of resources. A SR-IOV compliant PF has an additionalSR-IOV Extended Capability as part of its configuration space.
This extended capability in the PF's configuration space containsconfiguration information about all VFs associated with the PF. Inparticular it defines the BAR configuration for VFs as well as the typeof the VF.
While the BAR configuration for VFs is described in the associatedPF's Extended Capability, each VF also has a standard PCIeconfiguration space entry. However, certain fields in a VF'sconfiguration space are ignored or are undefined.
Of particular note is that the Vendor ID and Device ID fields in aVF's configuration space are not defined and have to be taken from theassociated PF's configuration space fields. Due to this arrangement allVFs of a PF have to be of the same type.
Further, as outlined above, the BAR configuration entries in a VF'sconfiguration space are undefined as the PF's extended capabilitydefines the BARs for all VFs. Each VF has its own set of MSI/MSI-Xvectors and these are configured using the VF's PCIe configurationspace.
The SR-IOV standard anticipates that host software (includingvirtualization software) require a PCI Manager (PCIM) to manage PFs,VFs, their capabilities, configuration, and error handling. However thestandard explicitly does not define any implementation of the PCIM.
An implementation would typically present VFs areas normal PCIdevices to the OS and/or Hypervisor and mask the differences, forexample in BAR configuration and other VF configuration space accesses,through a software emulation. Thus a PCIM implementation is verysimilar in functionality to the vPCI module used for PCI deviceassignment. In fact in most implementation the vPCI module and the PCIMimplementation cooperate.
SR-IOV for network devices
From a virtualization point of view SR-IOV capable network devicescombine PCI device assignment with the network virtualizationtechniques of modern MQ devices. With the help of the PCIM SR-IOVs VFsare typically treated as standard PCI devices which are directlyassigned to Guest VMs.
Since VFs are using different Requester IDs the chipsets IO MMU canbe set-up to provide appropriate DMA protection and with each VF owningtheir own MSI/MSI-X vectors, interrupts can be directed to the coresexecuting the Guest VMs. Thus, SR-IOV provides the same low overheadand latency access to IO devices as PCI device assignment.
Like modern MQ NICs, SR-IOV capable NICs require to multiplex andde-multiplex traffic between VFs and they typically implement the samefixed functionality: Layer 2 switching combined with some basic higherlevel filtering. Thus, SR-IOV NICs are similarly limited in flexibilityas MQ NICs.
Challenges with SR-IOV. SR-IOV is well suited forproviding hardware support for virtualizing fixed function devices suchas network cards. However, its design has limitations in supportinghighly programmable IO devices.
With SR-IOV VFs are enumerated in a hardware based PCIconfiguration space and all VFs associated with a PF have to be of thesame device type. Programmable IO devices may allow vendors todynamically create virtual functions and use different types of devicefunctions to provide different interfaces to the IO device.
For example a networking device may be able to offer a standard NICinterfaces as well as interfaces for efficient packet capture andnetwork interfaces offloading network security protocols.
With SR-IOV these three types of network interfaces would have tobe represented as three different PFs, each with a set of VFsassociated to them. From the SR-IOV standard it is unclear if theassignment of VFs to different PFs can be easily achieved, i.e., it isunclear how dynamic VFs of a given type can be created.
This limitation is a direct result of SR-IOV requiring VFs to beenumerated in hardware, which also results in higher hardware cost andcomplexity. Despite this additional cost and complexity, a softwarecomponent, the PCIM, is still required to manage VFs.
Next, in Part 3: Product How-To – Netronome's IOV Solution.
To read Part 1, go to “
Nabil Damouny is the senior director of businessdevelopment at NetronomeSystems . He has a BSEE from Illinois Institute of Technology(IIT) and a MSECE from the University of California Santa Barbara(UCSB). He holds 3 patents in computer architecture and remotenetworking.
Rolf Neugebauer is a Staff Software Engineer at NetronomeSystems were he works on virtualization support for Netronome's line ofIntelligent Network Processors. Prior to joining Netronome, Rolf workedat Microsoft and Intel Research. At Intel he was one of the initialresearchers developing the Xen hypervisor in collaboration withacademics at Cambridge University. Rolf holds a PhD and a MSc from theUniversity of Glasgow.