Editor’s Note: In this Product How-to development article the author uses Freescale’s i.MX Quad processor platform to describe the problems faced when using hypervisor virtualization in GPU-based automotive infotainment, navigation and instrumentation modules and some possible predictive scheduling solutions.
Hypervisor-based virtualization has become a hot topic in the field of consumer electronics design because it allows applications to share in the use of the new powerful graphics processor units (GPUs) from companies such as AMD and Nvidia. Designing with hypervisor-based virtualization is not simple to begin with, but the challenges become daunting when applied to automotive applications where the safety of the driver and the passenger is a crucial factor.
Three automotive subsystems that make use of the capabilities of a GPU include the instrumentation cluster, the electronic navigation subsystem, and infotainment options (Figure 1 ). In addition, but not shown in Figure 1, Camera based ADAS (Advanced Driver Assistance Systems) will be included in many next generation automobiles.
There is now a trend in new automotive designs toward consolidating these applications such that they can share in the use of a single GPU, which both increases processor utilization and lowers costs (Figure 2 ).
This sharing is being achieved through the use of hypervisor-based virtualization middleware architectures, which in the automotive environment present a unique set of challenges including:
- Mixed criticality – Unlike the infotainment and navigation subsystems, the instrument cluster is a critical module that immediately affects the safety of the driver and passengers. It needs to access the GPU in a timely manner and with a certain level of determinism.
- Security – Software or media installed on the infotainment system by the driver and/or passengers may be infected with malware that could take over the GPU and affect the instrument cluster output.
Hypervisor virtualization can be used to deal with these challenges utilizing memory space isolation of applications and time frame isolation of applications (this is an inherently difficult task since GPUs are not preemptive).
Using Freescales’s i.MX 6 Quad (4x Cortex-A9@ 1.2GHz) platform (Figure 3 ), as an illustration, this article will describe how these goals can be achieved in an automotive environment.
An important factor that influences the approach taken is the GPU driver architecture. While the platform has three GPUs, we will concentrate the following exposition only around the infotainment cluster’s 3D GPU, as it is inherently the most complicated one and the approaches discussed will apply to the other GPU functions as well. The GPU’s software driver architecture is shown in Figure 4.
Figure 4: GPU software driver architecture
The GPU driver software is divided into two sections, one for user space operations and the other for kernel space functions. Data and commands are sent to the GPU through a front end interface. The data sets communicated deal with information about vertex position, attributes, color, and other “normal” vectors.
There are two types of commands: those that modify the state of the GPU and those that do not. LoadState commands modify the state of the GPU and include such things as shader instructions, choosing the type of primitive to draw, and setting the window size or the number of attributes per vertex. Commands that do not modify the GPU state include drawing a primitive, stall, wait, etc.
Thesynchronization between hardware and software is in large part hiddenfrom the user. The user side partition contains command buffers wherethe application software inserts commands for the GPU. On the kernelside, command queues are implemented, through which the command buffersfrom the application layer are managed. Communication between the userspace driver and the kernel space driver is handled through ioctlcommands.
The kernel layer controls the hardware, manages systemand video memory, command queues, and events. Only one thread at a timecan commit command buffers and have access to the video memory.
Thehardware is synchronized with the software through event scheduling. Thesoftware can sometimes get out of synch with the hardware, at whichpoint the software can decide whether nor not to eliminate a a commandbuffer or a GPU surface operation to bring things back into synch. Bymeans of an event command, the hardware can be programmed to generate aninterrupt upon its execution. This event command can be appended in thecommand queues and so a synchronization mechanism can be put in place.
Two Approaches to GPU sharing
Usingvirtualization, there are two ways for the various automotivesubsystems to share the resources of the common GPU hardware: GPUsharing at Open GL level, and GPU sharing at kernel driver level.
The first method, as shown in Figure 5 ,offers a number of benefits, mostly related to development cost, due tothe fact that it is a generic approach that can be ported to otherplatforms without much overhead.
Thisapproach takes advantage of the client-server architecture of Open GLand manages graphics sharing by forwarding Open GL and EGL commands.However, a window manager proxy is also necessary so that importantconfigurations are sent to the driver partition (Figure 5 ).
Inthe various automotive subsystems it is important, for driver andpassenger safety, that graphical indicators are always placed in thesame screen area. To do this, the EGL and Open GL commands and windowconfigurations are received by a Graphics server that checks theirprovenience and forwards the configuration and the commands toappropriate modules. In the i.MX 6, the IPU (image processing unit) is ahardware block that manages the displays.
In these cases, theWindow manager functionality will have to be extended with rules thatare typical in such graphics-based subsystems, such as guaranteeing thata window is visible and rejecting windows that come from applicationsthat are not trusted.
This method of virtualization permits goodisolation of OEM software from the middleware and third party software,which reduces malware and security problems and facilitates an eventualcertification.
But the portability and the isolation come with aprice tag: additional software and processor time. For example, becausethe proxy is done at such a high level in the processing stack, memorycopy operation of vertex buffers and textures cannot easily be avoided.To avoid such overhead, a form of shared memory mechanism has to be putin place.
In addition, secure inter-partition communication hasto be implemented at the hypervisor level so that isolation and securityrequirements can be guaranteed.
In contrast to the user level proxy, the second method (Figure 6 ) involves the use of virtual partitions based on forwarding the ioctl commands and not directly on the Open GL commands.
Thisapproach has the benefit of being closer to the hardware level and socan take advantage of the internal mechanisms of the associated driversoftware. Fewer ioctl commands are needed when compared to the EGL andOpen GL commands.
By appropriate shared memory usage one canallocate the command buffers, VBOs, and textures so that they arevisible in the driver partition and thus avoid unnecessary memory copyoperations. But because it is implemented in the middleware the approachlacks the portability advantage of the first method.
IntelligentGPU scheduling can be employed in both the approaches to virtualizationjust described. But it cannot to easily localized to the degreenecessary because this particular functional module needs informationfrom both user space and kernel space. Since the GPU is not preemptive,the intelligence in GPU scheduling comes instead from predicting the runtime duration of certain commands or programs and distribute the loadso that the hardware can process the critical cases while not completelyignoring the others.
One approach to such scheduling, at leastin the case of the i.MX 6, is to make use of its GPU driver to reset thehardware and so avoid the case of malware software taking overcompletely. While this mechanism is certainly useful it does not solvethe GPU scheduling in a complete and satisfactory way.
A better approach is to use a predictive scheduling scheme that has the following mechanisms:
- A model of predicting the run time duration of GPU shaders
- A method of adjusting the model prediction based on measurements of the actual duration
Inthis case, predicting the run time duration is usually done byemploying a simplified GPU model. This allows the developer takeadvantage of the fact that it is specific to the hardware being modeled.Such a model has to take into account the following factors:
- Inputs to the vertex shader
- Vertex shader code
- Input to the fragment shader
- Fragment shader code
Whilethe inputs to the vertex shader are known before sending the commandsto the GPU, the fragments as produced after the rasterization are notknown, but can be estimated with a minimal CPU overhead.
Estimatingthe number of cycles needed to execute the vertex and the fragmentshader is usually based on a hardware model that has to be simplified sothat it can be used at run time.
1. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness ,ISCA '09 Proceedings of the 36th annual international symposium onComputer architecture, ACM SIGARCH Computer Architecture News, 2009
2. Work distribution methods on GPUs , Christian Lauternack et al, Technical Report: TR09-016, 2009
3. The role of Virtualization in Embedded Systems , Gernot Heiser, OK Labs, 2008
4. Graphic Engine Resource Management , Mikhail Bautin et al, International Society for Optical Engineering; 1999
Robert Krutsch is a System Solution Engineer at Freescale Semiconductor , blogger, andsignal processing book author. He received a BSc andMSc degrees,specializing in automatics and computer science, from the PolytechnicUniversity of Timisoara; and BSc and MSc degrees from the AutomationInstitute at the University of Bremen.