With rising network traffic and the need for application awareness,content inspection, and security processing, the amount of network IOprocessing at line rates increases exponentially. This coupled with theneed for virtualization, places a huge burden on the network IOsubsystem.
At 10G and beyond this dictates the use of an IO virtualizationco-processor (IOV-P). By classifying network traffic into flows,applying security rules, and pinning flows to a specific VM (virtualmachine) on a specific core on the host, and/or by load balancingvarious flows into various VMs, the IOV-P enables the overall system toachieve full network performance.
As servers and network appliances in the data centers, and controlplane functions in infrastructure equipments, are built aroundcommodity multi-core CPUs ” specifically x86 architectures – IOcommunications are becoming dependent on the system interconnect, suchas PCIe. An 8 lane PCIe v2 interconnect can easily support over 10G ofnetwork IO traffic.
The increasing use of virtualization in servers, appliances andnetwork equipment means that the underlying IO subsystems explicitlyhave to support virtualization. Virtualized data center servers andappliances using IOV-P-based intelligent network cards provide eachVirtual Machine (VM) with its own virtual NIC, allowing a number of VMsto share a single 10GbE physical NIC (Network Interface Card).
Each virtual NIC can have its own IP and MAC address, and can beassigned to a separate VLAN. To the outside world and to the hostsub-system the virtual NIC, appears as a distinct and dedicated NIC. Inthe same way that multiple VMs running on a multi-core server replacesmultiple-physical servers the IOV-P can replace multiple NICs and helpreplace or simplify the top-of-the-rack switch and the server loadbalancer.
The result is higher overall performance, lower cost and easiersystem management using fewer NICs, cables, and switch ports whileachieving full network IO performance. Similar benefits apply tonetwork infrastructure equipment when IOV-P is used for intelligentservice blades and trunk cards serving the various line cards.
This three series of articles first discusses this new class ofnetwork IO virtualization architectures, and in
Effective Resource Utilization needs Virtualization
As companies grow, their IT infrastructure also grows leading to anincrease in the number of stand-alone servers, storage devices andapplications. Unmanaged, this growth can lead to enormous inefficiency,higher expense, availability issues, and systems management headachesnegatively impacting the company's core business. Smaller servers mayhave utilization rates of 20% or less.
To address these challenges, organizations are implementing avariety of virtualization solutions for servers, storage, applications,and clients environments. These virtualization solutions can deliverreal business value through practical benefits, such as decreased ITcosts and business risks; increased efficiency, utilization andflexibility; streamlined management; and enhanced business resilience,and agility.
Enter Server Virtualization
In virtualized severs running VMware, or Xen, the Physical NIC becomesisolated from the guest OS used by application software. The guest OS,such as Windows or Linux, uses a NIC driver to talk to a virtual NIC.The virtualization software (Hypervisor) emulates a NIC for each guestOS. One physical server could have 8 or 16 Virtual Machines, each ofwhich runs a guest OS talking to a virtual NIC.
In addition to allowing multiple guest OS's to share a singlephysical NIC, the Hypervisor typically emulates an Ethernet (L2) switchconnecting virtual machines to physical NIC ports. Implementing virtualNIC functions and virtual switching functions within the virtualizationsoftware is performance intensive and adds significant overhead in thenetworking path. This can reduce 10GbE throughput to 1GbE levels.
Introducing Network IO virtualization
The PCI-SIG IO Virtualization (IOV) working group is developingextensions to PCIe. The first IOV specification maintains a Single PCIeRoot complex (SR-IOV) enabling one physical PCIe device to be dividedinto multiple Virtual Functions. Each virtual function can then be usedby a virtual machine, allowing one physical device to be shared by manyvirtual machines and their guest OSs.
IO Virtualization – Implementation Options
In any given system there are a limited number of IO devices, typicallymany less than the number of VMs the system may be hosting. As all VMsrequire access to IO, a Virtual Machine Monitor (VMM) or Hypervisorneeds to mediate access to these shared IO devices. In this section wereview different IOV implementation options.
Software IO Virtualization. All VMMs and Hypervisorsprovide IO virtualization implemented in software. CommercialHypervisors offerings run IO virtualization software in a specialmanagement, or otherwise privileged VM, to virtualized IO devices asdepicted in Figure 1 below .
|Figure1: Software IO Virtualization for Network devices. All network trafficis passed through the Management VM adding significant virtualizationoverheads and latency.|
The management VM has access to all IO devices to be shared, and theOS in the management VM is running the normal device driver for thatdevice (labeled “DD” in the figure). The Management VM then needs tovirtualize the device and present it to other VMs.
Conceptually, network device IO virtualization is straightforward.Guest VMs have a virtual network interface with an associated MAC andIP addresses. In the Management VM the physical device is visible witha MAC and an IP address.
Thus, the Management VM can use standard network mechanisms such asbridging or routing to direct traffic received from the physicalinterface to the virtual interfaces in the guest VMs and to directtraffic received from a guest VMs to other guest VMs or to the physicalnetwork device.
In Figure 1 above a software implementation of a normalEthernet switch (labeled “SW”) performs this de-multiplexing andmultiplexing of traffic to and from guest VMs.
This type of software based IO virtualization requires an efficientinter VM communication mechanism to transport packets between theManagement VM and Guest VMs. For bulk data transfers either memorycopies or virtual memory techniques, such as page mappings or flipping,are deployed.
Further, a signaling mechanism is required allowing VMs to notifyeach other that they have packets to send. It is important that theinter VM communication mechanism does not violate basic isolationproperties between VMs.
For example, it should not be possible for a Guest VM to corrupt theManagement VM or access data in other Guest VMs. Ideally, Guest VMsshould also be protected to some extent from a misbehaving ManagementVM, though this is not completely possible due to the more privilegednature of the Management VM.
In Figure 1 above the inter-VM communication isrepresented by an entity in the management VM (the back-end) andcorresponding entity in the Guest VMs (the front-end). (in the back-end we are using the terminologyof Xen in this example, but both Microsoft's Hyper-V and VMWare's ESXServer have similar concepts. )
The front-ends in the guest VMs are normal network device drivers inthe Guest OS. However, they exchange network packets with theircorresponding back-end in the management VM using the aforementionedinter-VM communication mechanism.
Software based IO virtualization provides a great deal offlexibility: Within the management VM the virtual interfaces connectedto front-ends can connect to the physical interfaces in arbitrary ways.In the simplest and most common case, the virtual network devices areall connected to a software Ethernet bridge or switch.
For enterprise environments this is typically a VLAN capable switch.The management VM may also implement a firewall, or other forms offiltering, to protect Guest VMs, as well as providing logging or othermonitoring functions.
In some environments the management VM may also provide otherfunctions such as Network Address Translation (NAT) or routing. Infact, some Hypervisors allow arbitrary virtual networks to beconstructed to interconnect Guest VMs.
The obvious drawback of this flexibility is a significant processingoverhead particularly when dealing with received packets. Each packetis received into buffers owned by the Management VM which then needs toinspect the packet and determine the recipient Guest VM(s).Subsequently, the Management VM needs to transfer the packet into areceive buffer supplied by the recipient Guest VM.
While different techniques are used by different hypervisors theyall have to copy the packet data or exchange pages using page flipping,both of which incur significant overheads.
For example, a Xen system, without further optimization, spends morethan five times as many cycles per packet received as compared tonative Linux ( see K. K. Ram, J. R. Santos, Y. Turner, A. L. Cox, S.Rixner: “Achieving10 Gb/s using Safe and Transparent Network Interface Virtualization“.VEE 2009 ).
The network transmit path incurs less overheads. However, theManagement VM still has to inspect the packets transmitted by a GuestVM to determine where to send them to. Further, the management VM mayperform some header checks, e.g., to prevent MAC address spoofing, orit may need to rewrite the packet header, for example to add VLAN tagsor perform NAT.
This typically requires at least the packet header if not the entirepacket to be accessible within the Management VM, thus adding extra CPUoverheads on the transmit path as well.
Software-based IO virtualization, however, has its drawbacks: notonly does it add significant CPU overhead for each packet but it alsoadds significant latencies. Packets both on transmit as well as receiveare queued twice (at the device and for the inter VM communication).
Both the Management VM and the Guest VM may experience schedulinglatencies delaying the time taken to react to interrupts or inter VMsignals and increasing the latency for packet traffic.
Multi Queue NICs
Most modern NICs support multiple send and receive queues (MQ NICs) andmany commercial hypervisors make use of these MQ NICs to acceleratenetwork IO virtualization.
There are a number of different approaches for utilizing MQ NICs ina virtualization environment and the most suitable approach dependsheavily on the detailed capabilities of the NIC.
All MQ NICs provide some filtering of incoming packets to decidewhich receive queue to place them on. Typically, the filter is based onthe destination MAC address and/or VLAN tags. Some MQ NICs also offerfurther filtering based on very simple layer 3 and 4 rules.
Early models of MQ NICs did not apply any filtering to transmittedpacket, thus they could not handle packets destined for other VMsconnected to the same NIC. As a result, these MQ NICs requiredadditional software to handle inter-VM network traffic. However, modernMQ NICs typically do not have this limitation, thus simplifying thesoftware support required.
Figure 2 below shows a common architecture for using MQ NICsas an IOV solution in virtualized environments. The main idea is toassociate queues (more precisely sets of queues) with individual GuestVMs.
The OS in the Management VM still runs the device driver for thedevice. However, since the MQ NIC is performing the multiplexing andde-multiplexing of traffic, the Management VM does not contain softwareEthernet switch.
The Guest VMs still use a generic device driver (labeled “FE” orFront End) representing virtual network interfaces to their OS.However, unlike the software IO virtualization scenario they areconnected to a different, device-specific component in the ManagementVM (labeled “BE” or Back-End).
|Figure2: IO Virtualization with Multi Queue devices. The device performs allmultiplexing and de-multiplexing of network traffic significantlyreducing the CPU overheads on the Host.|
Compared to software based IOV, the receive path with MQ NICs ismuch more straightforward. A Guest VM will transfer buffer descriptorsto the Management VM, which can directly post these descriptors to thequeue associated with the Guest VM.
When packets arrive at the device the filter mechanism on the devicewill select the destination queue and DMA the packet into the bufferposted by the Guest VM. Subsequently, the descriptor is returned to themanagement VM which will forward it back to the guest VM.
Buffer descriptors have to be passed through the Management VM,rather than allowing the Guest VM to post descriptors directly, so thatthe Management VM can check that the memory referred to by thedescriptors belongs to the Guest VM. Without this check a guest VMcould either accidentally or maliciously cause a device to accessmemory belonging to another Guest VM, thus violating isolation betweenVMs.
The transmit path from a Guest VM to the device is alsostraightforward. Transmit descriptors are passed from the Guest VM tothe Management VM which passes them on to the Device. Once the packetis transmitted the notifications are passed back to the Guest VM viathe Management VM.
As should be obvious from this description, IO virtualization withMQ NICs incurs far less overhead than software based IO virtualizationsince the data does not need to be moved between VMs and the ManagementVM is not involved in the multiplexing and de-multiplexing of networktraffic. Using this type of IO virtualization close to 10Gb/s line ratecan be achieved with modern host hardware.
In the Xen implementation IO virtualization with MQ NICs stillincurs a per-packet overhead of about twice as many CPU cycles perpacket when compared to native Linux execution. Further, individualpackets still incur significant additional latency, as the descriptorshave to be passed through the Management VM.
The use of MQ NICs for IO Virtualization severely limits theflexibility offered by software based IOV as the packet multiplexingand de-multiplexing is performed by fixed function hardware. Typically,MQ NICs perform simple filtering at the MAC level in order to implementenough functionality for a simple L2 Ethernet switch.
Nabil Damouny is the senior director of businessdevelopment at NetronomeSystems . He has a BSEE from Illinois Institute of Technology(IIT) and a MSECE from the University of California Santa Barbara(UCSB). He holds 3 patents in computer architecture and remotenetworking.
Rolf Neugebauer is a Staff Software Engineer at NetronomeSystems were he works on virtualization support for Netronome's line ofIntelligent Network Processors. Prior to joining Netronome, Rolf workedat Microsoft and Intel Research. At Intel he was one of the initialresearchers developing the Xen hypervisor in collaboration withacademics at Cambridge University. Rolf holds a PhD and a MSc from theUniversity of Glasgow.