Virtualization facilities in ARMv8-based systems play a special role in these systems and consist of several components. While ARMv7 had a special CPU mode to run a hypervisor as an extension, in ARMv8, it has become a part of the architecture, and it has been integrated into the privilege-level system under the name EL2. At the same time, this mode only solves problems associated with the CPU accessing system resources, such as memory and peripherals. To improve the efficiency of transactions initiated by devices in a virtualized environment, a number of components have been developed for ARMv8-based systems, such as new interrupt controllers and IOMMUs. This article provides an overview of these facilities from the perspective of system software development.
Virtualization in ARMv8-based systems is organized as shown in Figure 1: the EL2 privilege level runs a hypervisor controlling the execution of virtual machines’ (VM) code and sharing of resources between them. The levels of EL1 (OS kernel, privileged code) and EL0 (unprivileged code) are left for VM instances. Address translation is performed in two stages (Figure 2): in the first stage, a so-called intermediate physical address (IPA) is calculated from a virtual address using first-level translation tables (pointers held in TTBR0_EL1/TTBR1_EL1 registers); in the second stage, the real physical address is calculated using the second-level table prepared by the hypervisor (the pointer is stored in the VTTBR_EL2 register). Such an organization provides effective privilege separation and isolation of VMs from the hardware. This allows, for example, one to have many instances of an identical VM.
Fig 1. Virtualization in ARMv8-based systems (Source: Auriga)
The two-level translation allows VMs to maintain their own translation tables while also allowing the hypervisor to fully control the final results. The EL2 privilege level is designed specifically to execute hypervisor code and has some differences from the other levels. Thus, it is the minimum privilege level where the special registers VTTBR_EL2, VTCR_EL2 are accessible as well as a number of others intended for VM management.
In the original version of the ARMv8 architecture, only one translation table is provided for the hypervisor, and another is provided for the current VM. The hypervisor has access to several special registers through which configuration parameters visible to VMs at the EL1 level are set, such as CPU identifiers (manufacturer, version, etc.) and the multiprocessor system ID. This allows one to expose VMs running on the same system to different topologies of virtual SMP systems and CPUs from different versions and manufacturers.
If an event requiring hypervisor intervention happens in the VM, its processing is performed as follows:
an exception occurs at the EL2 level;
according to its type, the appropriate handler is called from the table (the address is stored in the VBAR_EL2 register);
necessary actions are performed;
if needed, required values are put into registers;
the hypervisor returns to the VM where the exit occurred (or switches to another VM if the hypervisor is designed accordingly).
Fig 2. Address translation performed in two stages (Source: Auriga)
Events for which such VM exit exceptions occur are defined by the HCR_EL2 register bits. Thus, these can be system register accesses, including those available at the EL1 privilege level (e.g., TTBR0_ EL1/TTBR1_EL1, FAR_EL1), cache and TLB flush instructions, regular exceptions (interrupts, including those from timers and unsupported operation codes), and interrupt and event waiting instructions. Two-stage address translation enablement is also controlled by this register. In addition, a separate hardware timer is available at the EL2 level, which allows a hypervisor to configure a periodic interrupt, usually used to initiate VM switching, similar to the way tasks are switched in modern OSs.
The switching process also includes saving of the current VM context, loading of a new VM, and transferring of control to it. At the same time, VMs can perform hypervisor calls in a way resembling the way unprivileged code at the EL0 level performs system calls. To perform such a call, the VM places parameters in registers and executes the “hvc” instruction. This results in an exception at the EL2 privilege level that is processed in a standard way. Typically, this occurs when calling standardized PSCI protocol functions.
It should also be mentioned that the hypervisor can intercept calls from VMs to the trusted code routines (e.g., PSCI in non-virtualized environments is implemented there, and calls to it are processed at the highest privilege level, EL3). The ARMv8 architecture also contains additional facilities to improve the performance of virtualized environments: in addition to the shareability domains that the hypervisor can assign to reduce cache coherency traffic, each VM can be assigned its own identifier or VMID. Its use makes it possible to avoid the “expensive” TLB flush when switching VMs.
The original version of ARMv8 provided 8-bit identifiers that were later extended to 16 bits. In addition, in ARMv8.1, the second translation table for the EL2 level, TTBR1_EL2, was added as a part of VM host extensions so that the hypervisors of Type 2 (which were part of the host OS) had more possibilities. At the same time, as mentioned above, fully featured virtualization requires VMs to interact with peripheral devices (network adapters, storage controllers, etc.) with minimal hypervisor involvement as well as delivery interrupts from devices to processors.
System memory management unit
These aspects of virtualized environments in the ARMv8 systems are handled by two units: the generic interrupt controller (GIC) and the system memory management unit (SMMU) (Figure 3). SMMUs perform translation of I/O addresses in the same way as it is done for CPU-initiated memory accesses. The unit supports the one- and two-stage translation of I/O addresses. Due to this, the benefits of translation and protection of memory areas can be used in VMs as well as in the hypervisor. Hence, devices are allowed to read/write only to/from specific memory address ranges.
Fig 3. The system memory management unit (SMMU) (Source: Auriga)
Moreover, it is sometimes convenient to organize scatter–gather operations on I/O buffers by means of the SMMU. The usage model of translation stages is almost the same as that for the CPU cores (i.e., the output of the first stage produces an IPA unique to the current VM, and the output of the second stage produces the real physical address unique to the entire system). The format of SMMU translation tables is similar to that for the CPU, with some differences in page attributes. Page sizes of 4, 16, and 64 KB are supported as well as one or two translation tables, depending on register settings and the translation stage, and the full 48- or 52-bit address space.
Each involved device has its own translation context (which ultimately selects the associated translation table set). It is possible to share a single context among several devices. Context selection is performed by the unit using the so-called Stream ID, a hardware-dependent device identifier. Thus, for PCIe devices (physical or virtual functions), RID serves as an identifier that replicates the device address in the PCIe configuration space. SMMUs have their own TLBs and support VM IDs for acceleration. In case of incorrect configuration detection, translation errors, and other exceptions, SMMUs assert so-called context interrupts (i.e., interrupts bound to translation contexts).
SMMU maintenance resembles that of the CPU memory management unit (MMU). However, the operations on the processor MMU (TLB reset, translation result retrieval, etc.) are performed via special instructions, while for SMMUs, they are performed by accessing context registers. By the end of 2017, there were several versions of SMMU specifications, the latest being 3.1. SMMU versions 3.0 and 3.1 have support for extended stream IDs and use tables in RAM to match the IDs of streams and contexts. Such tables can have one or two levels. Table elements contain pointers to context descriptors that are also stored in the memory, as well as the VM identifier to which the element is related, and pointers to the second-level translation tables.
Context descriptors, in turn, contain pointers to the first-level translation tables. One of the important features of SMMUv3 is the ability to stall the execution of transactions until the software responds. Such a model allows devices to access the pages that are not in RAM (e.g., swapped out into a file/swap partition) or allocated speculatively. SMMUs can also automatically set changed (or dirty) page indication bits in the translation tables. This can simplify VM migration, snapshotting VM states, etc. SMMUv3 also supports VM identifier masks, which allows the sharing of translation tables between different VMs, thus reducing the TLB pressure.
In SMMUv3, both control and event signaling have been significantly refactored: this unit uses an event queue, which is a ring buffer in the memory, and the context interrupts are replaced with an interrupt that signals the appearance of new descriptors in the queue. To get pages that are accessed by devices, there is a separate queue, the so-called page request interface (PRI). Instead of context registers, as mentioned above, control blocks in memory are used, and the context management is performed by writing command descriptors to the command queue and submitting them.
The GIC plays a crucial role in virtualized environment functioning. At the end of 2017, the fourth version of the specification was the latest one, while the second version was the minimal one intended for ARMv8 processors. GIC itself is quite a complicated device due to the necessity of delivering interrupts in multiprocessor systems (existing implementations can have 256 or more hardware threads). However, this article only considers those controller features that are directly related to virtualization. Most of the GIC registers are not virtualized, which leads to VM exits when accessed. At the same time, the specification introduces such concepts as virtual interrupts.
Virtual interrupts can be classified into one of the two virtual groups: 0 and 1. Group 0 holds the so-called fast interrupt requests (FIQs), while Group 1 holds all the others (interrupt requests, IRQs). Virtual interrupts are processed by the processor in exactly the same way as physical ones. The processing of interrupts in virtualized environments based on ARMv8 is organized as follows: physical interrupts from the devices are sent to the EL2 level (to the hypervisor), and the hypervisor activates the corresponding virtual interrupt on the virtual processor if the interrupt is intended for it. Both system and service interrupts can be routed to the hypervisor. The hypervisor handles physical interrupts before they are virtualized, in accordance with the GIC specification.
Support for interrupt virtualization in GIC is backed by a list of events representing virtual interrupts, stored in corresponding registers and handled as virtual IRQs or FIQs. The control of virtual interrupts through the processor register interface resembles that of physical interrupts. Thus, software running on a virtual processor is able to do the following:
set virtual priority masks,
control the way virtual priority is interpreted within groups,
acknowledge virtual interrupts,
lower the priority of virtual interrupts, and
deactivate virtual interrupts.
To manage virtual interrupts, the CPU interface provides a set of system registers located at the same addresses as the physical interrupt control registers. This means that the control mechanism is absolutely transparent for the VMs. The number of registers on the virtual interrupt list is implementation-defined but is limited to 16. If the number of interrupts addressed to a virtual processor exceeds the number of available registers, the hypervisor can store corresponding events in the memory to write them into freed registers later. Prioritization of interrupts is performed in hardware.
The virtual interface generates interrupts addressed to the hypervisor, which signals events (empty interrupt lists, enabling and disabling groups, signaling the end-of-interrupt for the interrupts not in registers, etc.) to which the hypervisor should respond accordingly. In addition to the private peripheral interrupts (PPIs) and shared peripheral interrupts (SPIs), ARMv8 systems have a whole class of interrupts signaled by messages (message signaled interrupts (MSIs)) called locality-specific peripheral interrupts (LPIs).
GICv3 and higher versions of GIC have extended support for this interrupt class, which can process interrupt messages in accordance with special rules (interrupt translation services (ITSs)). However, these features have a somewhat indirect relation to virtualization, but it is worthwhile to describe them briefly to provide a general view of the changes introduced in GICv4.
Fig 4. Interfaces and interaction of the components with the interrupt controller in a virtualized environment in an ARMv8-based system (Source: Auriga)
While using ITSs, devices signal events by issuing write transactions with the destination address in the GITS_TRANSLATOR register. A write transaction consists of the data being written, which contain the event ID, and the source identifier, which is the same as for the SMMUs. The system software programs the ITS registers so that they point to device, collection, and interrupt memory tables in memory that contain rules to handle events from relevant sources that specify the target CPU core and the interrupt ID there.
Interrupts result in the setting of corresponding fields of elements in the pending interrupt table. For GICv3, this mechanism is defined only for physical interrupts (i.e., those signaled directly by devices). This leads to certain inconveniences in hypervisor implementation. In particular, it requires the hypervisor to perform all actions done by ITSs in software. GICv4 introduced the ability to generate such interrupts programmatically and to translate LPIs into virtual interrupts that are set up for corresponding virtual processors, for which additional tables describing the affinity of interrupts to target processors were introduced, as well as virtual pending interrupt tables. If a VM with a different identifier is being executed on the target physical CPU core mapped to the target vCPU during interrupt arrival, GICv4 generates a special interrupt designed to inform the hypervisor about it. To control the translation of virtual interrupts, new command types have been added to the ITS GICv4 command interface.
The facilities described in this article represent solid grounds for virtualized environment implementations, and they are now quite well supported by various hypervisors (both first and second types). Overall, there is ongoing development of both CPU and system facilities architecture based on requests from software developers. Thus, it will continue to provide information for new articles on this topic.
“ARM® Architecture Reference Manual ARMv8, for ARMv8-A architecture profile”
“ARM® System Memory Management Unit Architecture Specification SMMU architecture version 2.0”
“ARM® System Memory Management Unit Architecture Specification SMMU architecture version 3.0 and version 3.1”
“ARM® Generic Interrupt Controller Architecture Specification GIC architecture version 3.0 and version 4.0”
Sergey Temerkhanov is a Principal Software Engineer at Auriga, Inc. He has over 14 years’ experience in the field of embedded software development, including deep expertise in bootloader/firmware porting and development, Linux kernel, Linux device driver development, embedded Linux distribution building and deployment. During his career, Temerkhanov earned strong knowledge of C and C++, ARMv8, and x86 assembly, and profound experience in board bring-up, low-level software and driver development, hardware-assisted debugging, real-time control software development. He spent over 4 years as development team manager, working closely with the hardware development team to design and bring up CPU and FPGA based hardware. He is also experienced in design of special-purpose processor system architectures. Sergey Temerkhanov holds a Master of Technology (MTech) in radio engineering from Bauman Moscow State Technical University.
Igor Pochinok is the Head of Mobile and Embedded Software Systems Laboratory of the MSU RCC (Research Computing Center at Lomonosov Moscow State University). He is an author and co-author of multiple scientific papers, studies, and articles related to embedded software development and testing. Igor Pochinok holds a PhD in Computer Science.