Ray tracing: the future is now
Editor’s Note: In this Product How-To design article, Imagination’s Peter McGuinness explains the computational and cost difficulties of implementing ray tracing algorithms in price-sensitive consumer apps, and how Imagination’s PowerVR architecture offers an interactive ray tracing approach that solves these problems.
The concept of ray tracing has been bandied about for more than a generation. Since 1968, it has been viewed as a promising technology, especially for visual arts disciplines, and in particular, the entertainment industry. The ray tracing algorithm, as developed into its recursive form, can closely model the behavior of light in the real world so that shadows, reflections, and indirect illumination are available to the artist without any special effort. These effects, which we take for granted in life, are available only with a great deal of effort in a scanline rendering environment, so a practical, economical way of generating them implicitly in real time is highly desirable.
Unfortunately, the computational and system cost of all implementations to date has meant that ray tracing to date has been mostly limited to offline rendering or very high cost, high power systems that can be real time but lack the element of interactivity. In fact, the wait for practical ray tracing has been so long that at last year’s influential Siggraph conference in Anaheim, the headline session on ray tracing was wryly entitled: “Ray tracing is the future and ever will be”.
Today, however, there’s a novel, comprehensive approach to real-time, interactive ray tracing that addresses these issues and proposes a scalable, cost efficient solution appropriate for a range of application segments including games consoles and mobile consumer devices.
A brief ray tracing primer
The ray tracing process is simple: once the model is transformed into world space (the 3D coordinate system used for animation and manipulation of models) and a viewport has been defined, a ray is traced and intersected with the closest object. This primary ray determines the visibility of objects from the perspective of the camera. Assuming an intersection is found, three or more new rays are generated: reflection, refraction if appropriate, and one illumination ray for each light source.
These secondary rays are then traced, intersected, and new rays generated until a light source is intersected or some other limit is reached. At each bounce, a color contribution to the surface is computed and added into an accumulation buffer; when all rays are resolved the buffer contains the final image.
Once the model is created, the material properties defined, and the lights placed, all of the lighting effects occur automatically during the rendering process.
This is in contrast to the process of creating realistic per-pixel lighting effects in a scanline renderer: in this case, since the world space model is discarded before fragment rendering begins, all the lighting effects must be captured during the content creation process and ‘baked in’ to special purpose textures known as light maps, which then can be used to layer the pre-computed lighting values onto the surface of the objects.
While it is possible in principle to generate light maps dynamically as an inline pre-processing step, and implementations exist that do that, real-time applications generally adopt the pre-baking approach in order to reserve available horsepower for other rendering effects. For example, shadows, which must be computed dynamically in an interactive system, are handled by techniques that involve multiple passes over the scene geometry at run time in order to generate shadow maps, which, similar to the light maps, are applied during the final rendering pass.
Aside from the computational expense of this approach, it has numerous disadvantages due to the need to predict which maps will be needed and carry them as game assets, constraining interactivity and increasing the size of the data set needed to run the application. In addition to these problems, the maps are generated with a limited number of fixed resolutions, which poses the usual resolution problems associated with image-based techniques.
The general acceptance of ray tracing as a desirable technique is illustrated by its use alongside a number of ray tracing-like techniques in the offline light-map baking process used in today’s content-generation middleware packages. These are used to generate the input images for lightmap baking and point the way to an incremental method of introducing ray tracing into existing real-time rendering systems, such as OpenGL and DirectX, without abandoning the overall structure of those systems. This is obviously desirable since the use of existing tools and runtimes, as well as all of the sophisticated techniques already known to developers, can then be preserved and enhanced, providing a low-impact migration path to the newer techniques.
This is accomplished by adding the capability to allow any shader program to cast rays during its execution to the shading language used by the runtime engine. The rendering pipeline is not changed, it is still an immediate, incremental, scanline method of rendering; existing methods of primary visibility determination are retained but the rendering system is enhanced with a retained data structure and a method to resolve ray queries into it.
Once this capability is available, and with sufficient runtime performance, the pre-baking method can be abandoned, with the result that the workflow of application development is simplified while the visual quality and dynamism of the end result are significantly enhanced. The problems to date have been that adequate runtime performance has been lacking or too expensive and interactivity has been very limited; in particular, attempts to map ray tracing onto existing GPU hardware have run into serious problems of efficiency.
The main difficulty comes from the fact that current real-time rendering systems exploit screen space data locality to achieve performance through parallelization. The data associated with any task being worked on is coherent both in the sense that adjacent pixels in a triangle reside close together in memory and also in that they tend to share material properties so that access to textures, shader programs, etc. is efficient and amenable to parallelization. This is not the case with a ray-traced system, where visually-adjacent objects can be widely separated in the world-space coordinate system and where rays typically become widely divergent with successive bounces. So while the GPU-centered method works, it is inefficient to the point of impracticality.
There are two further problems related to the retained data structure: building and traversal. Interactive animation requires that the acceleration structure (typically a voxelized tree) must be updated in real time and current methods of updating are both slow and unpredictably expensive. Likewise, a lightweight method of traversal is needed to minimize data fetches from the acceleration structure; the following sections propose solutions for all three of these problems and describe the Imagination Technologies IP core implementation based on them.
The PowerVR ray tracing solution
Since the objective is to add ray tracing capability to a standard GPU, it is useful to discuss the basic organization of that unit. Variations exist, especially in low-level details but in general GPU shader units are organized into SIMD (Single Instruction, Multiple Data) arrays of ALUs (Arithmetic Logic Unit) into which tasks – groups of operations – are dispatched by schedulers for execution.
The operations, called instances in the PowerVR architecture, are chosen to maximize the extent to which they share coherency characteristics such as spatial locality, material properties, etc. in order to maximize the efficiency of fetching their attribute data from memory as well as the parallelism of their execution across wide SIMD arrays.
In the PowerVR architecture (Figure 3), the arrays are grouped into Unified Shading Clusters (USCs). The process of scanline rendering naturally results in a high degree of this type of coherence, so the arrays can be kept busy, with ALU latencies and the inevitable memory access latencies masked to a certain degree by task switching.
There are a number of data masters feeding into the schedulers to handle vertex-, pixel-, and compute-related tasks. Once the shading operation is done, the result is output into a data sink for further processing, depending on what part of the rendering pipeline is being handled.
The ray tracing unit (RTU) can be added to this list as both a data sink and a data master so that it can both receive (sink) new ray queries from the shaders and dispatch (master) ray/triangle intersection results back for shading. It contains registers for a large number of complete ray queries (with user data) attached to a SIMD array of fixed-function "Axis Aligned Bounding Box vs. Ray" testers and "Triangle vs. Ray" testers.
Importantly, there is a coherence gathering unit which assembles memory access requests into one of two types of coherency queues: intersection queues and shading queues, then schedules them for processing. Intersection queues are scheduled on to the SIMD AABB or triangle testers; shading queues are mastered out to the USCs.
Intersection queues are created and destroyed on the fly and represent a list of sibling Bounding Volume Hierarchy (BVH) nodes or triangles to be streamed in from off-chip memory. Initially the queues are typically full naturally because the root BVH nodes span a large volume in the scene and therefore most rays hit them consistently. When a full queue of rays is to be tested against the root of the hierarchy, the root nodes are read from memory and the hardware can intersect rays against nodes and/or triangles as appropriate.
For each node that hits, a new intersection queue is dynamically created and rays that hit that node are placed into the new child queue. If the child queue is completely full (which is common at the top of the BVH), it is pushed onto a ready stack and processed immediately.
If the queue is not full (which occurs a little deeper in the tree, especially with scattered input rays from the USC), it is retained in a queue cache until more hits occur against that same BVH node at a later time. In this mode, the queues effectively represent an address in DRAM to start reading in the future. This has the effect of coherence gathering rays into regions of 3D space and will dynamically spend the queues on areas of the scene which are more challenging to collect coherence against.
This process continues in a streaming fashion until the ray traverses to the triangle leaf nodes; when a ray is no longer a member of any intersection queue, the closest triangle has been found.
At this point, a new shading queue is created, but this time it is coherence gathering on the shading state that is associated with that triangle. Once a shading queue is full, this becomes a task which is then scheduled for shader execution. Uniforms and texturing state are loaded into the common store and parallel execution of the shading task begins: each ray hit result represents a shading instance within that task.
The behavior is then identical to that of a rasterization fragment shader with the added feature that shaders can create new rays using a new instruction added to the PowerVR shader instruction set, and send them as new ray queries to the RTU.
The RTU returns ray/triangle intersection results to the shaders in a different order than that in which they entered due to the coherence gathering. A ray that enters the RTU early in the rendering of a frame may be the last to leave depending on coherence conditions.
This approach to dynamic coherence gathering has the effect of parallelizing on rays instead of pixels, which means that even rays that originate from totally different ray trees from other pixels can be collected together to maximize all available coherence that exists in the scene. This then decouples the pipelines, creating a highly latency-tolerant system and enabling an extensive set of reordering possibilities.