Editor’s Note: In this Product How-To design article, Imagination’s Peter McGuinness explains the computational and cost difficulties of implementing ray tracing algorithms in price-sensitive consumer apps, and how Imagination’s PowerVR architecture offers an interactive ray tracing approach that solves these problems.
The concept of ray tracing has been bandied about for more than a generation. Since 1968, it has been viewed as a promising technology, especially for visual arts disciplines, and in particular, the entertainment industry. The ray tracing algorithm, as developed into its recursive form, can closely model the behavior of light in the real world so that shadows, reflections, and indirect illumination are available to the artist without any special effort. These effects, which we take for granted in life, are available only with a great deal of effort in a scanline rendering environment, so a practical, economical way of generating them implicitly in real time is highly desirable.
Unfortunately, the computational and system cost of all implementations to date has meant that ray tracing to date has been mostly limited to offline rendering or very high cost, high power systems that can be real time but lack the element of interactivity. In fact, the wait for practical ray tracing has been so long that at last year’s influential Siggraph conference in Anaheim, the headline session on ray tracing was wryly entitled: “Ray tracing is the future and ever will be”.
Today, however, there’s a novel, comprehensive approach to real-time, interactive ray tracing that addresses these issues and proposes a scalable, cost efficient solution appropriate for a range of application segments including games consoles and mobile consumer devices.
A brief ray tracing primer
The ray tracing process is simple: once the model is transformed into world space (the 3D coordinate system used for animation and manipulation of models) and a viewport has been defined, a ray is traced and intersected with the closest object. This primary ray determines the visibility of objects from the perspective of the camera. Assuming an intersection is found, three or more new rays are generated: reflection, refraction if appropriate, and one illumination ray for each light source.
These secondary rays are then traced, intersected, and new rays generated until a light source is intersected or some other limit is reached. At each bounce, a color contribution to the surface is computed and added into an accumulation buffer; when all rays are resolved the buffer contains the final image.
Once the model is created, the material properties defined, and the lights placed, all of the lighting effects occur automatically during the rendering process.
This is in contrast to the process of creating realistic per-pixel lighting effects in a scanline renderer : in this case, since the world space model is discarded before fragment rendering begins, all the lighting effects must be captured during the content creation process and ‘baked in’ to special purpose textures known as light maps, which then can be used to layer the pre-computed lighting values onto the surface of the objects.
While it is possible in principle to generate light maps dynamically as an inline pre-processing step, and implementations exist that do that, real-time applications generally adopt the pre-baking approach in order to reserve available horsepower for other rendering effects. For example, shadows, which must be computed dynamically in an interactive system, are handled by techniques that involve multiple passes over the scene geometry at run time in order to generate shadow maps, which, similar to the light maps, are applied during the final rendering pass.
Aside from the computational expense of this approach, it has numerous disadvantages due to the need to predict which maps will be needed and carry them as game assets, constraining interactivity and increasing the size of the data set needed to run the application. In addition to these problems, the maps are generated with a limited number of fixed resolutions, which poses the usual resolution problems associated with image-based techniques.
The general acceptance of ray tracing as a desirable technique is illustrated by its use alongside a number of ray tracing-like techniques in the offline light-map baking process used in today’s content-generation middleware packages. These are used to generate the input images for lightmap baking and point the way to an incremental method of introducing ray tracing into existing real-time rendering systems, such as OpenGL and DirectX, without abandoning the overall structure of those systems. This is obviously desirable since the use of existing tools and runtimes, as well as all of the sophisticated techniques already known to developers, can then be preserved and enhanced, providing a low-impact migration path to the newer techniques.
This is accomplished by adding the capability to allow any shader program to cast rays during its execution to the shading language used by the runtime engine. The rendering pipeline is not changed, it is still an immediate, incremental, scanline method of rendering; existing methods of primary visibility determination are retained but the rendering system is enhanced with a retained data structure and a method to resolve ray queries into it.
Once this capability is available, and with sufficient runtime performance, the pre-baking method can be abandoned, with the result that the workflow of application development is simplified while the visual quality and dynamism of the end result are significantly enhanced. The problems to date have been that adequate runtime performance has been lacking or too expensive and interactivity has been very limited; in particular, attempts to map ray tracing onto existing GPU hardware have run into serious problems of efficiency.
The main difficulty comes from the fact that current real-time rendering systems exploit screen space data locality to achieve performance through parallelization. The data associated with any task being worked on is coherent both in the sense that adjacent pixels in a triangle reside close together in memory and also in that they tend to share material properties so that access to textures, shader programs, etc. is efficient and amenable to parallelization. This is not the case with a ray-traced system, where visually-adjacent objects can be widely separated in the world-space coordinate system and where rays typically become widely divergent with successive bounces. So while the GPU-centered method works, it is inefficient to the point of impracticality.
There are two further problems related to the retained data structure: building and traversal. Interactive animation requires that the acceleration structure (typically a voxelized tree) must be updated in real time and current methods of updating are both slow and unpredictably expensive. Likewise, a lightweight method of traversal is needed to minimize data fetches from the acceleration structure; the following sections propose solutions for all three of these problems and describe the Imagination Technologies IP core implementation based on them.
The PowerVR ray tracing solution
Since the objective is to add ray tracing capability to a standard GPU, it is useful to discuss the basic organization of that unit. Variations exist, especially in low-level details but in general GPU shader units are organized into SIMD (Single Instruction, Multiple Data) arrays of ALUs (Arithmetic Logic Unit) into which tasks – groups of operations – are dispatched by schedulers for execution.
The operations, called instances in the PowerVR architecture, are chosen to maximize the extent to which they share coherency characteristics such as spatial locality, material properties, etc. in order to maximize the efficiency of fetching their attribute data from memory as well as the parallelism of their execution across wide SIMD arrays.
In the PowerVR architecture (Figure 3 ), the arrays are grouped into Unified Shading Clusters (USCs). The process of scanline rendering naturally results in a high degree of this type of coherence, so the arrays can be kept busy, with ALU latencies and the inevitable memory access latencies masked to a certain degree by task switching.
There are a number of data masters feeding into the schedulers to handle vertex-, pixel-, and compute-related tasks. Once the shading operation is done, the result is output into a data sink for further processing, depending on what part of the rendering pipeline is being handled.
The ray tracing unit (RTU) can be added to this list as both a data sink and a data master so that it can both receive (sink) new ray queries from the shaders and dispatch (master) ray/triangle intersection results back for shading. It contains registers for a large number of complete ray queries (with user data) attached to a SIMD array of fixed-function “Axis Aligned Bounding Box vs. Ray” testers and “Triangle vs. Ray” testers.
Importantly, there is a coherence gathering unit which assembles memory access requests into one of two types of coherency queues: intersection queues and shading queues, then schedules them for processing. Intersection queues are scheduled on to the SIMD AABB or triangle testers; shading queues are mastered out to the USCs.
Intersection queues are created and destroyed on the fly and represent a list of sibling Bounding Volume Hierarchy (BVH) nodes or triangles to be streamed in from off-chip memory. Initially the queues are typically full naturally because the root BVH nodes span a large volume in the scene and therefore most rays hit them consistently. When a full queue of rays is to be tested against the root of the hierarchy, the root nodes are read from memory and the hardware can intersect rays against nodes and/or triangles as appropriate.
For each node that hits, a new intersection queue is dynamically created and rays that hit that node are placed into the new child queue. If the child queue is completely full (which is common at the top of the BVH), it is pushed onto a ready stack and processed immediately.
If the queue is not full (which occurs a little deeper in the tree, especially with scattered input rays from the USC), it is retained in a queue cache until more hits occur against that same BVH node at a later time. In this mode, the queues effectively represent an address in DRAM to start reading in the future. This has the effect of coherence gathering rays into regions of 3D space and will dynamically spend the queues on areas of the scene which are more challenging to collect coherence against.
This process continues in a streaming fashion until the ray traverses to the triangle leaf nodes; when a ray is no longer a member of any intersection queue, the closest triangle has been found.
At this point, a new shading queue is created, but this time it is coherence gathering on the shading state that is associated with that triangle. Once a shading queue is full, this becomes a task which is then scheduled for shader execution. Uniforms and texturing state are loaded into the common store and parallel execution of the shading task begins: each ray hit result represents a shading instance within that task.
The behavior is then identical to that of a rasterization fragment shader with the added feature that shaders can create new rays using a new instruction added to the PowerVR shader instruction set, and send them as new ray queries to the RTU.
The RTU returns ray/triangle intersection results to the shaders in a different order than that in which they entered due to the coherence gathering. A ray that enters the RTU early in the rendering of a frame may be the last to leave depending on coherence conditions.
This approach to dynamic coherence gathering has the effect of parallelizing on rays instead of pixels, which means that even rays that originate from totally different ray trees from other pixels can be collected together to maximize all available coherence that exists in the scene. This then decouples the pipelines, creating a highly latency-tolerant system and enabling an extensive set of reordering possibilities.
Implementing a ray tracing shader
In order for this approachto be effective, a critical mass of in-flight rays needs to bemaintained in fast on-chip SRAM. A non-blocking ray tracing model isemployed by the design, which allows the amount of state that can becarried with any ray to be carefully bound.
For example, atypical depth first ray tracing shader that would cast a ray in order todetermine the color of the object that ray intersects; since this canoccur recursively, a large stack of shader states can build up for everyray making it impractical to get enough rays in flight to coherencegather effectively and fit within precious on-chip memory.
Inthe non-blocking model, the shader creates the ray, writes all theinformation needed to resolve that ray later and then emits the ray intothe RTU, and completes without waiting for the result.
When the raythen executes a shader at some point in the future (non-deterministicdue to the coherence gathering going on in RTU), it knows the pixel itis contributing to and the contribution of the ray due to the colorinformation that was passed along. This can continue recursively downthe ray tree, and shaders are free to cast multiple rays per shade. Thekey is that all the emitted rays within the originating pixel willaccumulate back into the pixel with various contribution factorsmodulated by the various shaders that are hit as the rays bounce around.
Theaccumulates occur in an on-chip ‘Frame Buffer Accumulator’ (FBA)memory, which caches pixels on chip and acts as an atomicReadModifyWrite floating point adder for those pixels. A successfulimplementation of a production renderer using this non-blocking approachfound that there are only a small number of behaviors that can't beeasily mapped into this model
The final element is building theacceleration structure (Figure 4), which must be a fully dynamic systemcapable of augmenting a rasterizer. This is done in a ‘Scene HierarchyGenerator’ (SHG), which implements an algorithm to build the BVH for theRTU in a streaming fashion and places it directly after the vertexshader. The SHG builds the AABB hierarchy in a bottom up fashion andwrites it directly into DRAM.
Internally, the SHG treats theentire world as a log2 sparse oct-tree. It has a core concept of aspatial node, which is an integer address within the log2 oct- tree,i.e. [xyz] and “Level” (which is the log2 size of the voxels at thatlevel). These integral representations of 3D space are re-linked rapidlyin a small on-chip SRAM but never actually leave the chip until theyare properly arranged and are then streamed out as AABB BVH nodes.
TheSHG reads each input triangle from the vertex shader exactly one timeand writes out AABBs into DRAM in a streaming manner, starting at thebottom of the tree and building up. The SHG algorithm makes two keyassumptions that allow for this favorable behavior:
The size of a triangle relative to the overall scene approximately represents the triangle density in that area.
Trianglesare usually members of a mesh and therefore have some inherent spatiallocality in their submission order from the vertex shader.
Usingthe first assumption, the SHG compares the size of each incomingtriangle against the overall scene size to determine at which log2 levelto voxelize the triangle. Big triangles have lower log2 levels and arehigher in the tree; small triangles go deeper. Furthermore, “long andskinny” triangles that are off-axis are pushed deeper so as to get morefine-grain voxelization and therefore more efficient bounding by theAABBs that will later result.
Thevoxelizers produce nodes for the triangles, which are fed through a 3Dvoxel cache. This cache determines the spatial grouping of triangles -when the cache has a hash collision or is flushed, nodes aregenerated. (This 3D caching scheme is built on the second assumptionmentioned above and in practice works well.)
These nodes then inturn have a parent node address computed and are fed again through avoxel cache. The algorithm works its way up the tree, writing outgrouped AABBs into DRAM ready for the RTU.
How it plays out for developers and gamers
Froma graphical developer’s perspective, the barrier to using ray tracingis lowest if it is possible to retain all of the currently useddevelopment flow, including tools and APIs. This means that, rather thanswitch wholesale to a new rendering scheme (such as the primary raymethod of visibility determination described above) it is desirable tocreate a hybrid system where an incremental scanline algorithm is usedfor visibility determination but where ray tracing can be selectivelyadded in order to implement specific effects.
A graphical APIsuch as OpenGL does not have the primary ray concept used to determinevisibility in a pure ray tracing renderer; this function is performed bythe scanline rasterization algorithm which takes place in the screenspace coordinate system whereas rays must be cast in world space. Oneway to resolve this issue is to make use of a multipass renderingtechnique known as deferred shading (Figure 5 ).
Thistechnique is commonly used in game engine runtimes and consists of afirst, geometry, pass which performs the visibility determinationfollowed by a second, shading, pass which executes the shader programsattached to the visible geometry.
The objective of thistechnique is to reduce the overhead of shading objects which areinvisible but it can also be used to cast the starting rays in a hybridsystem (Figure 6 ). Since the intermediate information storedafter the first pass includes world space coordinates as well as thingslike surface normals, the shader program has everything needed to castthe first rays.
As shown below in Figure 7 and Figure 8 ,this method benefits from the fact that only visible pixels will castrays, but it also means that the developer has the choice of castingrays only for selected objects and can therefore easily control howeffects are used and where the ray budget is spent.
It can alsobe easily fitted into existing game engine runtimes so that workflowremains the same and investment in all existing game assets ispreserved. This level of control means it is possible to progressivelymove assets from incremental techniques to ray tracing techniques,giving the developer the flexibility needed to successfully manage thattransition.
Some compelling use cases
Themost obvious use of ray tracing in a game is to implement fully dynamiclights with shadowing and reflections generated at runtime. Theimprovements in realism and interactivity which this makes possibleinclude shadows and reflections that are free of sampling artifacts andthe ability to remove constraints on freedom of movement of the player.This enhanced freedom of movement is an enabling technology forapplications such as virtual and augmented reality and opens up newapplications in areas such as online shopping.
In addition, thedeveloper can now access a broad range of other capabilities that wouldeither be impossible, low quality, or too inefficient using standardtechniques.
Some examples of these include:
- Lens effects (Figure 9 ). By the simple expedient of placing a lens model with the appropriate characteristics between the eye point and the scene rendered, effects like depth of field, fisheye distortion, or spherical aberration can be created as an effect (or corrected for, if desired).
- Stereo and lenticular display rendering . Newly popular computational photography techniques such as light field rendering require the user to generate a number of images of the scene from different viewpoints; two in the case of stereo but many more for use with multi-viewpoint lenticular displays. Scanline rendering incurs the overhead of transforming the geometry multiple times into the various viewports whereas ray tracing can reuse the transformation results and simply cast rays from each needed viewpoint.
- Targeted rendering to points of interest is easily implemented by varying the number of rays spent on each area of the scene, depending on whether it is the main focus of attention or not.
- Line-of-sight calculations are not necessarily a graphical technique but can be used to improve the artificial intelligence of actors in a game by allowing them to cast a ray in order to establish if they are visible to another actor or to a light source. In another use, casting rays can also be useful in implementing collision detection.
Figure 9: Physics, Lens distortion correction and lenticular display rendering can be implemented using ray tracing.
ThePowerVR ray tracing solution described here is available today forsilicon implementation in a cost and power profile suitable for handheldand mobile devices (today’s dominant platforms for games as well asother consumer-centric activities).
The performance and featuresit offers along with a low-risk migration path is compelling todevelopers who want to simplify their content creation flow at the sametime as creating more compelling, more realistic games. As thistechnology is rolled out over the coming months, it will soon bepossible to say that the promise of ray tracing is being fulfilled – thefuture will finally have arrived.
Peter McGuinness is Director of Multimedia Technology Marketing for Imagination Technologies Group plc . Hehas an extensive background in the architecture and design ofintegrated circuits and systems for graphics and video, where he holds anumber of patents. He began his career as a silicon chip designer in1980 at Plessey Research in England, working first in analog and laterin digital desig followed by work at Inmos Ltd. (now part ofSTMicroelectronics) on the Transputer, then at IBM as part of the designteam that developed the ramdac for IBM's PS2 computer. He has alsoworked at Nvidia on the Riva 128 GPU. As director of research at theSTMicroelectronics San Diego Advanced Systems technology lab he leddevelopment in the area of image based rendering and hardwareaccelerated ray tracing. He can be contacted at .