Facebook Patent | Systems And Methods For Reducing Rendering Latency
Publication Number: 10553013
Publication Date: 20200204
In one embodiment, a computing system may determine a first orientation in a 3D space based on first sensor data generated at a first time. The system may determine a first visibility of an object in the 3D space by projecting rays based on the first orientation to test for intersection. The system may generate first lines of pixels based on the determined first visibility and output the first lines of pixels for display. The system may determine a second orientation based on second sensor data generated at a second time. The system may determine a second visibility of the object by projected rays based on the second orientation to test for intersection. The system may generate second lines of pixels based on the determined second visibility and output the second lines of pixels for display. The second lines of pixels are displayed concurrently with the first lines of pixels.
This disclosure generally relates to computer graphics, and more particularly to graphics rendering methodologies and optimizations for generating artificial reality, such as virtual reality and augmented reality.
Computer graphics, in general, are visual scenes created using computers. Three-dimensional (3D) computer graphics provide users with views of 3D objects from particular viewpoints. Each object in a 3D scene (e.g., a teapot, house, person, etc.) may be defined in a 3D modeling space using primitive geometries. For example, a cylindrical object may be modeled using a cylindrical tube and top and bottom circular lids. The cylindrical tube and the circular lids may each be represented by a network or mesh of smaller polygons (e.g., triangles). Each polygon may, in turn, be stored based on the coordinates of their respective vertices in the 3D modeling space.
Even though 3D objects in computer graphics may be modeled in three dimensions, they are conventionally presented to viewers through rectangular two-dimensional (2D) displays, such as computer or television monitors. Due to limitations of the visual perception system of humans, humans expect to perceive the world from roughly the same vantage point at any instant. In other words, humans expect that certain portions of a 3D object would be visible and other portions would be hidden from view. Thus, for each 3D scene, a computer-graphics system may only need to render portions of the scene that are visible to the user and not the rest. This allows the system to drastically reduce the amount of computation needed.
Raycasting is a technique used for determining object visibility in a 3D scene. Conventionally, virtual rays are uniformly cast from a virtual pin-hole camera through every pixel of a virtual rectangular screen into the 3D world to determine what is visible (e.g., based on what portions of 3D objects the rays hit). However, this assumes that uniform ray distribution is reasonable when computing primary visibility from a virtual pinhole camera for conventional, rectangular display technologies with a limited field of view (e.g., computer monitors and phone displays). This assumption, however, does not hold for non-pinhole virtual cameras that more accurately represent real optical sensors. Moreover, current VR viewing optics (e.g., as integrated within a head-mounted display), provide a curved, non-uniform viewing surface rather than conventional rectangular displays. As a result, conventional rendering techniques, which are designed and optimized based on the aforementioned assumptions, are computationally inefficient, produce suboptimal renderings, and lack the flexibility to render scenes in artificial reality.
SUMMARY OF PARTICULAR EMBODIMENTS
Particular embodiments described herein relate to a primary visibility algorithm that provides real-time performance and a feature set well suited for rendering artificial reality, such as virtual reality and augmented reality. Rather than uniformly casting individual rays for every pixel when solving the visibility problem, particular embodiments use a bounding volume hierarchy and a two-level frustum culling/entry point search algorithm to accelerate and optimize the traversal of coherent primary visibility rays. Particular embodiments utilize an adaptation of multi-sample anti-aliasing for raycasting that significantly lowers memory bandwidth.
Particular embodiments further provide the flexibility and rendering optimizations that enable a rendering engine to natively generate various graphics features while maintaining real-time performance. Such graphics features–such as lens distortion, sub-pixel rendering, very-wide field of view, foveation and stochastic depth of field blur–may be particularly desirable in the artificial reality context. The embodiments provide support for animation and physically-based shading and lighting to improve the realism of the rendered scenes. In contrast, conventional rasterization pipelines designed for conventional displays (e.g., rectangular monitors or television sets with uniform grids of pixels) are typically implemented in hardware and require multiple passes and/or post processing to approximate these features. Moreover, conventional ray tracers, which primarily focus on Monte Carlo path tracing, do not achieve real-time performance on current VR displays (e.g., with 1080.times.1200.times.2 resolution and 90 Hz refresh-rate requirements). The embodiments described herein, therefore, is particularly suitable for rendering artificial reality and present a concrete, viable alternative to conventional rasterization techniques.
Embodiments of the invention may include or be implemented in conjunction with an artificial reality system. Artificial reality is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., a virtual reality (VR), an augmented reality (AR), a mixed reality (MR), a hybrid reality, or some combination and/or derivatives thereof. Artificial reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs). The artificial reality content may include video, audio, haptic feedback, or some combination thereof, and any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). Additionally, in some embodiments, artificial reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an artificial reality and/or used in (e.g., perform activities in) an artificial reality. The artificial reality system that provides the artificial reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing artificial reality content to one or more viewers.
The embodiments disclosed herein are only examples, and the scope of this disclosure is not limited to them. Particular embodiments may include all, some, or none of the components, elements, features, functions, operations, or steps of the embodiments disclosed above. Embodiments according to the invention are in particular disclosed in the attached claims directed to a method, a storage medium, a system and a computer program product, wherein any feature mentioned in one claim category, e.g. method, can be claimed in another claim category, e.g. system, as well. The dependencies or references back in the attached claims are chosen for formal reasons only. However, any subject matter resulting from a deliberate reference back to any previous claims (in particular multiple dependencies) can be claimed as well, so that any combination of claims and the features thereof are disclosed and can be claimed regardless of the dependencies chosen in the attached claims. The subject-matter which can be claimed comprises not only the combinations of features as set out in the attached claims but also any other combination of features in the claims, wherein each feature mentioned in the claims can be combined with any other feature or combination of other features in the claims. Furthermore, any of the embodiments and features described or depicted herein can be claimed in a separate claim and/or in any combination with any embodiment or feature described or depicted herein or with any of the features of the attached claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates an example of a bounding volume hierarchy tree data structure.
FIG. 2 illustrates an example three-level hierarchy for defining locations from which rays/beams are projected.
FIG. 3** illustrates an example of rays and subsample rays associated with a footprint**
FIG. 4 illustrates an example of a beam being cast through a tile.
FIG. 5 illustrates an example of a beam being cast through a block.
FIGS. 6A-C illustrate an example of a method for determining visibility.
FIG. 7 illustrates an example of a focal surface map.
FIG. 8 illustrates an example of a focal surface map and camera parameters.
FIG. 9 illustrates a method for natively generating an image with optical distortion for a VR device.
FIG. 10 illustrates an example of an importance map.
FIG. 11 illustrates an example method for generating an image based on varying multi-sample anti-aliasing.
FIG. 12 illustrates examples comparing a graphics-generation timeline without using beam racing to timelines using beach racing.
FIG. 13 illustrates an example method for generating video frames for a VR display using beam racing.
FIG. 14 illustrates an example computer system.
DESCRIPTION OF EXAMPLE EMBODIMENTS
One of the fundamental problems in computer graphics is determining object visibility. At present, the two most commonly used approaches are ray tracing, which simulates light transport and is dominant in industries where accuracy is valued over speed such as movies and computer-aided designs (CAD). Due to the intense computational requirements of ray tracing, it is traditionally unsuitable for applications where real-time or near real-time rendering is needed. Another approach for determining visibility is z-buffering, which examines each 3D object in a scene and updates a buffer that tracks, for each pixel of a virtual rectangular screen, the object that is currently closest. Typically, z-buffering is implemented by current graphics hardware and lacks the flexibility to handle rendering tasks that deviate from the aforementioned assumptions (e.g., pin-hole camera and/or rectangular screens with uniform pixel distributions). Particular embodiments described herein provide a visibility algorithm that has performance characteristics close to that of z-buffering, but with additional flexibility that enables a wide variety of visual effects to be rendered for artificial reality.
To provide further context, conventional z-buffering is often used for addressing real-time primary visibility problems, largely due to its applicability to uniform primary visibility problems (e.g., for conventional rectangular screens) and the availability and proliferation of inexpensive, specialized hardware implementations. The z-buffer algorithm uses a z-buffer, a uniform grid data structure that stores the current closest hit depth for each sample/pixel. Most implementations of z-buffering assume samples/pixels are laid out in a uniform grid, matching precisely to the organization of the data structure. The uniform nature of the grid structure, combined with the uniform distribution of samples mapped onto this grid, allows for a very efficient algorithm for determining which samples overlap a polygon/triangle. The process of mapping the spatial extent of an object onto the grid is known as rasterization.
The uniform nature of the grid used in the z-buffer algorithm leads to high efficiency but makes the algorithm inflexible. The assumed uniform sample distribution is reasonable when computing primary visibility from a virtual pin-hole camera for almost all direct view display technologies such as TVs, monitors or cell phones. However, these assumptions do not hold for non-pinhole virtual cameras, secondary effects such as shadows and notably for modern virtual reality devices due to the distortion imposed by the viewing optics of a head mounted display, and currently must be worked around on a case-by-case basis.
Algorithms such as the irregular z-buffer still use a uniform grid but allow for flexible number and placement of samples within each grid cell. Irregular z-buffering suffers from load-balancing issues related the conflict between non-uniform sample distributions in a uniform data structure, making it significantly more expensive than traditional z-buffering. Further, having a uniform data structure means that the algorithm supports only a limited field of view and does not support depth of field rendering.
In contrast to z-buffering, ray tracing algorithms take a more general approach to determining visibility by supporting arbitrary point-to-point or ray queries. The ability to effectively model physically-based light transport and naturally compose effects led it to be the dominant rendering algorithm rendering movie scenes. However, the flexibility that ray tracing provides comes at significant cost in performance, which has prevented it from becoming prevalent in consumer real-time applications, such as VR/AR.
Particular embodiments described herein overcome the shortcomings of existing rendering techniques to achieve ray rates in excess of 10 billion rays per second for nontrivial scenes on a modern computer, naturally supporting computer-graphics effects desirable for artificial reality.
Particular embodiments address the visibility problem in computer graphics. In particular embodiments, a rendering system may use a raycaster that uses a three-level (or more) entry-point search algorithm to determine visibility. At a high level, the system may take a hierarchical approach where larger beams (e.g., a coherent bundle of rays) are first cast to determine collision at a broader scale. Based on the hits/misses of the beams, more granular beams or rays may be cast until the visibility problem is solved. It should be noted that even though certain examples provided herein describe beams as representing coherent bundles of primary rays, this disclosure contemplates using beams to represent any type of rays (e.g., primary rays, specular reflection rays, shadow rays, etc.) whose coherent structure may be exploited by the embodiments described herein to achieve computational efficiency. In particular embodiments, the system may be implemented in a heterogeneous manner, with beam traversal occurring on the central processing unit (CPU) and ray-triangle intersection and shading occurring on the graphics processing unit (GPU). In other embodiments, every computation task may be performed by the same type of processing unit.
In order to improve performance, particular embodiments may use an acceleration structure to organize scene geometry. These structures may be based on space partitioning (grids, k-d or k-dimensional tree, binary space partitioning or BSP tree, octree) or object partitioning (bounding volume hierarchy or BVH). By organizing the geometry into spatial regions or bounding them in enclosing volumes, the structures allow a system to avoid testing rays with objects if the rays do not enter the volume bounding the object.
In particular embodiments, an axis-aligned bounding volume hierarchy is a hierarchical tree data structure that stores scene geometry (usually triangles) at the leaves of the tree and an axis-aligned bounding box at each node. The bounding box associated with each node may conservatively enclose all of the geometries associated with the node’s sub-tree. In particular embodiments, rays (or other visibility queries such as beams) may be traversed recursively through the tree from the root and tested against nodes’ children’s bounding volumes. Recursive traversal of a node’s children may only occur in the case of intersection, so rays/beams can avoid traversing portions of the tree whose parent nodes are miss by the rays/beams.
FIG. 1 illustrates an example of a BVH tree data structure 100. Each node (e.g., 110, 120-128, 130-136) in the tree 100 may be associated with a bounding volume in the 3D modeling space in which objects are defined. The tree 100 may have a root node 110 that is associated with a large bounding volume that encompasses the bounding volumes associated with the child nodes 120-128, 130-136. Node 120 may be associated with a bounding volume that contains the bounding volumes of its child nodes 121-128 but not the bounding volumes of nodes 130-136. Node 121 may be associated with a bounding volume that contains the bounding volumes of its child nodes 122-125 but not the bounding volumes of any of the other nodes. Node 122 may be associated with a bounding volume that contains geometries (e.g., triangles) but not any other bounding volume.
In particular embodiments, the system may use a four-way axis-aligned BVH as the geometry acceleration structure. In particular embodiments, a single, combined BVH may be used for all scene geometry. In other embodiments, the system may take a multilevel approach to allow for instancing and to enable more efficient animation by allowing for more granular BVH rebuilds and refits. Rudimentary animation is supported via global BVH refit per frame.
In particular embodiments, the BVH may be laid out in memory in depth-first preorder and store triangles in a contiguous array, in the order they would be touched in a depth-first traversal of the BVH. Additionally, any node with a mix of leaf and internal children may store the leaf children first. With these assumptions, iterating in reverse through the list of BVH nodes may guarantee that a node’s children will always be visited before it will and that all triangles will be visited in a linear, reverse, order. These assumptions enable a linear, non-recursive BVH refit algorithm and improves cache locality during refit, traversal and intersection.
Particular embodiments for computing primary visibility may perform visibility tests using the BVH. As previously described, whether an object (or portion thereof) is visible from a particular viewpoint may be determined by testing whether the object (or the portion thereof) intersects with a ray. Shooting multiple rays from each pixel for every pixel can be computationally expensive and resource intensive, however, especially when the area that needs to be covered is large. For example, if 32 sample rays are used per pixel, 32 intersection tests would need to be performed for each pixel and a ray buffer of sufficient size needs to be allocated to store the results. Shooting so many rays may be especially wasteful in scenes with few objects, since most of the rays would not intersect anything.
Instead of shooting rays, particular embodiments may perform intersection tests using beams. In particular embodiments, the system may perform hierarchical intersection tests using, in order, (1) larger frusta beams that project from a relatively larger “block” (a beam footprint to be described in more detail below), (2) smaller frusta beams that project from “tiles” (also a beam footprint to be described in further detail below), and (3) procedurally generated subsample rays (interchangeably referred to as “subrays” herein). In particular embodiments, unless an intersection is found using the larger frusta beam, intersection tests need not be performed for the sub-beams or rays, thereby avoiding unnecessary computations. In particular embodiments, pixel shading may be based on a single subsample ray’s intersection results rather than the results of all 32 subsample rays. To further optimize performance, particular embodiments may procedurally generate (e.g., which may be pseudo-randomly) the subsample rays on the fly when performing intersection tests, rather than retrieving predefined subsample ray locations from memory. Procedural ray generation has the benefit of not needing to read from memory, thereby saving time and bandwidth.
FIG. 2 illustrates an example three-level hierarchy for defining locations from which rays/beams are projected. In particular embodiments, a ray footprint 210, which may be considered as the first-level of the hierarchy, may correspond to a pixel footprint and be defined by a footprint center 240 and differentials 250 (e.g., each differential 250 may specify the distance of a footprint’s boundary from the center 240 of the footprint 210). In particular embodiments, subsample locations 260a-e are assigned to a footprint 210 for anti-aliasing. In the example shown in FIG. 2, each footprint 210 includes five subsample locations 260a-e. Although the particular example shows five subsample locations 260a-e per footprint 210, any other number of subsample locations may be implemented and/or defined by an application (e.g., 32 subsample locations per footprint 210).
FIG. 3 illustrates an example of rays and subsample rays (or subrays) associated with a footprint. In particular embodiments, a ray 310 may be associated with a footprint 210, with the ray 310 projecting through the center 240 of the footprint 210. As shown in FIG. 2, the footprint 210 may be defined by two vectors 250 (also referred to as differentials) that are mutually perpendicular with each other and the ray direction. The extent of the footprint 210 may be defined by the length of these vectors 250. In particular embodiments, subsample rays (or subrays) may be procedurally generated within this footprint 210 by first transforming a low-discrepancy point set on, e.g., the unit square of the footprint 210, using the coordinate frame defined by the ray 310 direction and footprint vectors 250. Examples of transformed points are represented by the subsample locations 260a-e, illustrated as hollow points. Ray directions may then be added to the transformed points (e.g., subsample locations 260a-e), and the subsample rays 360a-e may be defined to be rays projecting through the original ray’s 310 origin 301 and the newly transformed points 260a-e, respectively. The subsample rays, for example, may be used for multi-sample anti-aliasing. In particular embodiments, for depth-of-field rays, the ray origin may also be chosen using a separate low-discrepancy point set (without translating along the ray direction).
In particular embodiments, shading may be performed one per pixel per triangle, as in regular multi-sample anti-aliasing (MSAA), which saves a large amount of shading computations. In particular embodiments, shading may be performed for every sample to get full super-sample anti-aliasing (SSAA). Since the subrays are procedurally generated rather than predefined (and stored in memory), the ray memory bandwidth may be reduced by the anti-aliasing factor when compared to naively rendering at higher resolution.
In particular embodiments, primary rays (e.g., 310) are assigned footprints (e.g., 210) for anti-aliasing and then aggregated into a second-level hierarchy and a third-level hierarchy with four-sided bounding beams with different granularity. Each beam location in the finer, second-level hierarchy may be referred to as a tile (e.g., an example of a tile is labeled as 220 in FIG. 2). Each tile 220 may include a predetermined number of pixels footprints 210. Although the example shown in FIG. 2 illustrates each tile 220 having 2.times.2 pixel footprints 210, any other arrangements of pixel footprints 210 may also be implemented (e.g., each tile 220 may include 16.times.8 or 128 pixel footprints 210). The tiles 220 may then be aggregated into a coarser collection of blocks 230, which is the term used herein to refer to the third-level hierarchy. Each block 230 may contain a predetermined number of tiles 220. Although the example shown in FIG. 2 illustrates each block 230 containing 2.times.2 tiles 220, any other arrangements of tiles 220 may also be implemented (e.g., each block 230 may include 8.times.8 or 64 tiles). Thus, in an embodiment where each block 230 contains 8.times.8 tiles 220 and each tile 220 contains 16.times.8 pixel footprints 210, each block 230 may represent 8,192 pixel footprints 210. The number of rays represented by a beam stemming from a block 230 can be computed by multiplying the number of pixel footprints 210 in the block 230 by the multi-sampling rate (e.g., 5 subsample rays per pixel footprint 210). Thus, if each block 230 represents 8,192 pixel footprints 210, the block 230 would represent 40,960 subsample rays. In particular embodiments, the choice of defining the ratio between pixels and tiles to be 128:1 and the ratio between tiles and blocks to be 64:1 is based on coarse tuning for particular hardware, but other ratios may be more optimal for other types of hardware.
In particular embodiments, instead of casting rays for all visibility tests, beams may be cast from the blocks and tiles in a hierarchical manner to optimize visibility computation. FIG. 4 illustrates an example of a beam 420 being cast from a point origin 301 (e.g., the camera or viewpoint) into a 3D modeling space through a tile 220 (in the illustrated example, the tile 220 contains 2.times.2 pixel footprints 210). The solid beam 420 in this example resembles a frustum stemming from the tile 220. The volume of the beam 420 may be defined by the vectors projecting through the four corners of the tile 220. An object or triangle intersects with the beam 420 if the object/triangle intersects with any portion of the volume of the beam 420. Similarly, FIG. 5 illustrates an example of a beam 530 being cast from a point origin 301 into a 3D modeling space through a block 230 (in the illustrated example, the block 230 contains 2.times.2 tiles that each contains 2.times.2 pixel footprints 210). The solid beam 530 in this example resembles a frustum stemming from the block 230. The volume of the beam 530 may be defined by the vectors projecting through the four corners of the block 230. An object or triangle intersects with the beam 530 if the object/triangle intersects with any portion of the volume of the beam 530.
Particular embodiments for scene updates and triangle precomputation will now be described. In particular embodiments, before rendering begins, animation may be performed (e.g., the 3D object models in the scene may change) and the BVH is refit. In particular embodiments, bone animation may occur on the CPU, while linear blend skinning and BVH refit may be implemented in a series of CUDA kernels in the following example stages: (1) transform vertices (perform linear blend skinning); (2) clear BVH node bounds; (3) precompute triangles (e.g., by (a) gathering vertices (b) compute edge equations (for a Moller-Trumbore ray-triangle intersection), and (c) computing triangle bounds and atomically update corresponding leaf bounding box); and (4) refit BVH by propagating bounds from leaf nodes up through internal node hierarchy. In particular embodiments, after refit is performed on the GPU, the BVH may be copied back to CPU memory for the block and tile traversal stages. At this point, block and tile bounds may be computed and refit, if needed.
FIGS. 6A-C illustrate an example of a method 600 for determining visibility according to particular embodiments. The illustrated embodiment performs a three-level entry point search algorithm, but additional levels may also be implemented in other embodiments (e.g., using a fourth beam footprint that includes a collection of blocks, a fifth bean unit that includes a collection of fourth beam units, and so on). In particular embodiments, the three-levels are conceptually divided into a block culling phase, a tile culling phase, and a ray sample testing phase.
FIG. 6A illustrates an example of a block culling phase. During this phase, a computing system may traverse through the BVH hierarchy and use beams stemming from blocks (e.g., as shown in FIG. 5) to test for intersections with selected bounding boxes associated with nodes in the BVH. In particular embodiments, each such beam is defined by 128.times.64 pixel footprints. In particular embodiments, the implementation of beam traversal uses an explicit stack AVX implementation. Because block traversal is a culling/entry point search phase, rather than traversing all the way to the leaves, as a traditional ray tracer would do, block traversal only traverses until it reaches a specific stopping criterion (e.g., when 64 entry points have been discovered).
In particular embodiments, a screen for which a scene is to be generated (e.g., a virtual screen in the 3D space that corresponds to the display screen used by the user) may be divided into n number of blocks. For each block, the system may perform a three-level test to determine what is visible from that block. In particular embodiments, the visibility test may be performed by projecting a beam from the block. For ease of reference, a beam projected from a block is referred to as block beam herein.
In particular embodiments, the method may begin at step 610, where an explicit traversal stack (e.g., a data structure used to track which nodes of the BVH is to be tested for intersection) may be initialized with the BVH’s root (e.g., node 110 shown in FIG. 1), which is associated with a bounding volume (BV), which may be a bounding box, for example. The bounding volume may be defined within the 3D space. In certain scenarios, the bounding volume may contain smaller bounding volumes. For example, referring to FIG. 1, every child node (e.g., 120-128 and 130-136) corresponds to a bounding volume within the bounding volume of the root 110. Objects (e.g., primitive geometries such as triangles, larger objects defined by a collection of primitive geometries, etc.) defined in the 3D space may be contained within any number of bounding volume. For example, an object contained by the bounding volume associated with node 125 is also contained within the bounding volumes associated with nodes 123, 121, 120, and 110.
At step 612, the system may access a bounding volume, based on the traversal stack, to test for intersection with the block beam. As an example, initially the system may perform intersection tests with the bounding volume associated with the root node 110, and in later iterations perform intersection tests against child nodes of the root node 110, depending on what is in the traversal stack. In particular embodiments, at each step during traversal, the thickest box along the primary traversal axis in the traversal stack may be tested. This allows the system to more efficiently refine the nodes down to individual surface patches. Despite the overhead of sorting, it has been observed that this improved tile/block culling performance by 5-10%.
At step 614, the system may simulate the projection of a beam, defined by a block, into the 3D space to test for intersection with the selected bounding volume. As shown in FIG. 5, the volume 530 of the block beam may be tested against the bounding volume to determine the extent, if any, of intersection.
At step 616, the system may determine that the outcome of the intersection test is one of the following: (1) a miss–meaning that the beam misses the bounding volume entirely; (2) fully contained–meaning that the beam contains the bounding volume fully/entirely; or (3) partial intersection–meaning that the beam and the bounding volume intersect but the bounding volume is not fully contained within the beam. If the system determines, at step 618, that the test outcome is a miss, the system may remove/discard the subtree of the current node from being candidates to be tested for intersection with the block beam. For example, referring again to FIG. 1, node 132 may represent a miss, meaning that the bounding volume associated with the node 132 does not intersect the block beam at all. As such, the smaller bounding volumes contained within that bounding volume (e.g., bounding volumes associated with nodes 133-136 in the subtree of node 132) need not be tested further, thereby providing substantial computational savings. If instead the system determines, at step 620, that the test outcome is fully contained, the system may accumulate the bounding volume as an entry point and no further traversal of the associated node’s subtree is required as it is transitively fully contained. Referring to FIG. 1, node 121 may represent a bounding volume that is fully contained within the block beam. As such, the smaller bounding volumes contained within that bounding volume (i.e., bounding volumes associated with nodes 122-125 in the subtree of node 121) need not be tested against the block beam (but may be tested in the subsequent tile-culling phase), thereby providing substantial computational savings. If instead the system determines, at step 622, that the test outcome is partially contained (e.g., the bounding volume partially intersects with the block beam), the system may add/insert the subtree associated with the bounding volume into the traversal stack for continued refinement. Referring to FIG. 1, node 110 may represent a bounding volume that partially intersects the block beam. As such, the smaller bounding volumes contained within that bounding volume may be further tested against the block beam. For example, in response to a determination that the bounding volume of node 110 partially intersects with the block beam, the system may insert the top node of each subtrees (e.g., node 120 and 130) into the traversal stack for further intersection tests against the block beam.
In particular embodiments, at step 624, the system may determine whether one or more terminating conditions for the block-culling phase are met. If no terminating condition is met, the system may continue to perform intersection tests against bounding volumes associated with the nodes stored in the traversal stack. For example, after determining that the bounding volume of the root node 110 partially intersects the block beam, the system may, in the next iteration, test whether the smaller sub-bounding volume associated with, e.g., node 120 or 130 intersects with the block beam. This process may continue until a terminating condition is met. For example, traversal may continue until the traversal stack is empty. If so, the system may sort the accumulated entry points (or fully contained bounding volumes) in near depth order and pass them onto the tile-culling phase for further processing. Another terminating condition may be when the sum of the size of the traversal stack and the size of the list of fully contained bounding volumes equals a prespecified value, such as 32, 64, 128, etc. In particular embodiments, the traversal stack and the list may be merged, sorted in near depth order and passed onto the tile-culling phase. Thus, no more than a fixed number of entry points are ever passed from the block-culling phase onto tile-culling phase.
In particular embodiments, during traversal, the separating axis theorem is used to determine separation between bounding volumes and the block beam. When sorting the entry points before hand-off to tile cull, the near plane along the dominant axis of the beam may be used as the key value.
The tile-culling phase picks up where the block culling phase left off. In particular embodiments, each entry point identified during the block-culling phase is further tested using 64 tile-culling phases (e.g., corresponding to the 8.times.8 or 64 tiles in the block, according to particular embodiments). In particular embodiments, tile culling may be implemented in an explicit stack AVX traversal, as in block cull. However, rather than beginning by initializing the traversal stack with the root node in the BVH, the traversal stack may be initialized by copying the output of the associated block cull. In this way, tile cull avoids duplicating a significant amount of traversal, performed during block cull. In particular embodiments, the beam/box tests have similar potential outcomes as in block cull, but traversal may continue until the traversal stack is empty. Once all triangles have been gathered, they are copied through CUDA to the GPU for sample testing. In particular embodiments, in high depth complexity scenes, excessive numbers of triangles may be eagerly gathered and potentially tested, despite the fact that they may be occluded by nearer geometry. Short-circuiting tile traversal may require interleaving tile cull and sample testing, which implies migrating tile cull to a CUDA implementation.
FIG. 6B illustrates an example of a tile-culling phase to process the outcome from a block-culling phase. As previously described, the block-culling phase may generate a list of entry points by projecting a block beam and testing for intersections with bounding volumes. The resulting entry points may be used as the starting points during the subsequent tile-culling phase. The tile-culling phase, in general may test for intersection using tile beams contained within the block beam. As previously described, each second beam is defined by a tile footprint that is smaller than the block footprint.
The tile-culling phase for processing the result of a block-culling phase may begin at step 630, where the system may iteratively select an entry point in the list generated during the block-culling phase and perform tile culling. The entry point, which is associated with a node or bounding volume in the BVH, is known to intersect with the block beam. In the tile-culling phase, the system attempts to determine, at a finer granularity, which tiles of the block intersects with the bounding volume or its sub-volumes. Thus, given a selected entry point, the system, may iteratively project tile beams contained within the block beam to test for intersections.
For a given entry point, the system, at step 632, may iteratively select a tile in the block to perform intersection test. In particular embodiments, prior to testing the entry point against a particular tile beam, the system may initialize a traversal stack to be the bounding volume associated with the entry point. Doing so provides efficiency gains, since the tile-culling phase need not start from the root of the BVH (the work has already been done during the block-culling phase). Referring to FIG. 1 as an example, the system may initialize the traversal stack with node 121, which was determined to be a suitable entry point during the block-culling phase.
At step 634, the system may access a bounding volume, based on the traversal stack, to test for intersection with the block beam. As an example, initially the system may perform intersection tests with the bounding volume associated with the node 121, which was deemed a suitable entry point during the block-culling phase, and in later iterations perform intersection tests against its child nodes, depending on what is in the traversal stack. In particular embodiments, at each step during traversal, the thickest box along the primary traversal axis in the traversal stack may be tested.
At step 636, the system may simulate the projection of a beam, defined by a tile, into the 3D space to test for intersection with the bounding volume. As shown in FIG. 4, the volume 420 of the tile beam may be tested against the bounding volume to determine the extent, if any, of intersection.
At step 638, the system may determine that the outcome of the intersection test is one of the following: (1) a miss–meaning that the beam misses the bounding volume entirely; (2) fully contained–meaning that the beam contains the bounding volume fully/entirely; or (3) partial intersection–meaning that the beam and the bounding volume intersect but the bounding volume is not fully contained within the beam. If the system determines, at step 640, that the test outcome is a miss, the system may remove/discard the subtree of the current node from being candidates to be tested for intersection with the tile beam. If instead the system determines, at step 642, that the test outcome is fully contained, the system may accumulate the triangles/polygons in the bounding volume to be tested in the subsequent phase. No further traversal of the associated node’s subtree is required as it is transitively fully contained. In other words, any additional bounding volume contained within the current bounding volume may be removed from being a candidate to be tested for intersection with the tile beam. If instead the system determines, at step 644, that the test outcome is partially contained (e.g., the bounding volume partially intersects with the tile beam), the system may add/insert the subtree associated with the bounding volume into the traversal stack for continued refinement (e.g., when the process repeats at step 634).
As an example, the system may start with an entry point such as node 121 in FIG. 1. In certain scenarios, the system may determine that the bounding volume associated with node 121 partially intersects the projected tile beam. Based on this determination, the system may insert the subtrees of node 121 (e.g., 122 and 123) into the traversal stack. Then when repeating step 634, the system may test node 122, for example, for intersection by projecting the tile beam against the bounding volume associated with the node 122. In certain scenarios, the system may determine that the projected tile beam fully contains the bounding volume associated with node 122 and adds the triangles/polygons in the volume to a list for sampling.
In particular embodiments, the traversal may continue until the traversal stack is empty. Thus, at step 646, the system may determine whether any nodes remain in the traversal stack. If a node exists in the stack, the system may return to step 634 to test that node against the tile beam. If no more node exists in the stack, then at step 648 the system may determine whether there are additional tiles in the block that have not yet been tested against the original entry point. If so, the system may return to step 632 to test the entry point against an un-tested tile for intersections. Otherwise, the system at step 650 may determine whether additional entry points from the block-culling phase still need to be tested. If so, the system may return to step 630. Otherwise, the system in particular embodiments may pass the gathered triangles/polygons onto the ray sample testing phase.
In particular embodiments, a ray sample testing phase may be performed after the tile-culling phase. In particular embodiments, the ray sample testing phase may be broken into per-tile and per-pixel phases. In particular embodiments, both phases may be completed using a single CUDA kernel with a workgroup size of 128. In the per tile portion, threads may be mapped 1:1 with triangles and in the per-pixel phase threads may be mapped 1:1 with pixels. In particular embodiments, the threads may alternatively be mapped 1:1 with subpixels, in which case the phase may be referred to as the per-subpixel phase. As used in this context, a subpixel is an individual LED, such as red, green or blue, and is distinct from a subsample in the multi-sample anti-aliasing sense. Thus, a subpixel may have many subsamples. The system may support both multi-sample anti-aliasing (MSAA) and super-sample anti-aliasing (SSAA), the distinction being that in MSAA shading is performed only once per pixel per triangle and the results are shared across all subsamples of that pixel that strike the same triangle, and that in SSAA shading is computed separately per subsample. The advantage of MSAA is a potentially large reduction in shading rate. Triangle data for the tile may be gathered into a shared local cache on the GPU for ease of access from all samples. This triangle cache may have 128 entries. In particular embodiments, the per-tile and per-pixel/subpixel phases may alternate until all triangles for a tile have been processed.
FIG. 6C illustrates an example of a sample testing phase to process the triangle intersections identified after a tile-culling phase. After a tiling-culling phase, the system may have identified a list of triangles that intersect with the associated tile beam. During the sample testing phase, the system attempts to further sample the triangles at the finer pixel level and may utilize subsample rays to do so (e.g., for anti-aliasing). In particular embodiments, the aforementioned per-tile phase may be represented by steps 660 to 668 and the per-pixel/subpixel phase may be represented by the steps starting from step 670.
At step 660, before testing the triangles that intersect with a tile beam, the system may perform initializations by, e.g., clearing the per-subsample depth and index values. During the per-tile phase, the system may, at step 662, gather triangle data into a shared memory cache. At step 664, the system may perform back-face and near plane culling on the triangles. At step 666, the system may test tile corner rays against triangles and classify the intersections as fully contained, partial intersection, and miss, similar to the classifications described above. At step 668, the system may perform common origin intersection precomputations (when applicable).
In particular embodiments, once the per-tile phase has completed, each thread may associate itself with a pixel/subpixel, and performs the following steps during the per-pixel/subpixel phase. The system may test for intersection using rays contained within the tile beam with which the triangles intersect. In particular embodiments, the rays (including subsample rays) may be procedurally generated.
In particular embodiments, at step 670, the system may, for each pixel/subpixel in the tile, look up a footprint center and differentials associated with the ray’s footprint. At step 672, the system may transform the center (e.g., FIG. 2, labels 240) and differentials (e.g., FIG. 2, labels 250) into the 3D world space. The center and differentials may define the ray footprint through which rays may be generated and projected into the 3D space to test for intersections with the objects (e.g., triangles).
In particular embodiments, the system may iteratively project rays associated with the ray footprint against each triangle. For example, at step 674, after the ray footprint has been determined, the system may fetch a triangle from cache. The triangle may be tested against each ray sample in the ray footprint iteratively. For example, at step 676, the system may, for each ray sample, compute subsample offset within the pixel footprint via a lookup table (e.g., FIG. 2 at label 260a). At step 678, the system may compute ray-triangle intersection for the ray defined by this subsample (e.g., FIG. 3 at label 360a). At step 680, upon determining that the ray intersects with the triangle, the system may store information associated with the intersection. For example, the system may update subsample depth and triangle index in the case of successful intersection. After a single ray of the pixel/subpixel has been tested against the triangle, the system may determine at step 682 whether additional samples should be made (e.g., based on FIG. 2, each pixel has five subsamples). If so, the system may repeat step 676 to generate another ray and test for intersection. If the triangle has been tested against all the subsamples, the system, at step 684, may determine whether there are more triangles that intersect with the tile beam, as determined during the tile-culling phase. If so, the system may repeat step 674 to sample another triangle using rays associated with the current ray footprint. If no more triangle exists, the system may then determine whether there are more pixels/subpixels within the tile that should be tested against the triangles. If so, the system may repeat step 670 to perform intersection tests using rays from another pixel/subpixel against the triangles that intersect with the tile.
Once all of the triangles have been processed, visibility for the tile is fully resolved and the per-subsample depth and triangle index buffers contain the closest hit for each subsample. In particular embodiments, at this point subsample data may be compressed and emitted to a “gbuffer” in preparation for shading. The “gbuffer” in this case may consist of only visibility information: pairs of triangle indices and subsample masks, which is sufficient to recompute barycentrics and fetch vertex attributes in the shading phase. The “gbuffer” may be allocated to be large enough to hold a fixed number (e.g. 32) of entries in order to handle the case where each subsample of each subpixel strikes a different triangle and is stored in global memory on the GPU. Memory is addressed such that the first triangle for each pixel are adjacent in memory, followed by the second triangles, etc., so in practice only a small prefix of this buffer is actually used. In particular embodiments, compression may perform the following steps: (1) sort the subsamples by triangle index; and (2) iterate over the subsamples and emit, e.g., triangle index and multi-sample mask for each unique triangle index. Once the “gbuffer” has been constructed the sample testing phase is complete and the resolve/shading phase begins.
After visibility has been computed during the sample testing phase, the system may perform shading, aggregates MSAA or SSAA samples and computes final pixel/subpixel color to a buffer that can be presented as a rendered computer-generated scene that includes visible objects defined within the 3D space. In particular embodiments, each sample location is read from the output of the previous stage and ray intersections are computed for each “gbuffer” entry at the pixel center. Then, shading may be performed using attributes interpolated using barycentrics obtained during intersection, and the shading result may be accumulated per pixel. Once all “gbuffer” entries are processed per pixel, the system may perform filmic tonemapping and output the results to the final buffer for display. In the case of SSAA, rays may be generated, intersected, and shaded independently rather than having a single weighted shading result per entry.
In particular embodiments, the resolve/shading phase may include the following steps. For example, the system may look up sample location (e.g., from a linear buffer of samples generated in a previous stage). The system may then compute differentials (e.g., analytically when in closed form, otherwise in finite differencing). Then the system may transform ray and differentials to 3D world space in preparation for intersection. The system may clear the shading result accumulator to 0. Then for each “gbuffer” entry, the system may fetch triangle data, perform ray-triangle intersection for the pixel center and compute depth and barycentric coordinates, clamp baycentric coordinates to triangle bounds, interpolate vertex attributes based on barycentric coordinates, perform shading/lighting, and accumulate shading result. The system may then scale the accumulated shading results by, e.g., 1/subsampling rate. The system may then perform tonemapping and output the results.
Particular embodiments may repeat one or more steps of the method of FIGS. 6A-C, where appropriate. Although this disclosure describes and illustrates particular steps of the method of FIGS. 6A-C as occurring in a particular order, this disclosure contemplates any suitable steps of the method of FIGS. 6A-C occurring in any suitable order. Moreover, although this disclosure describes and illustrates an example method for performing primary visibility computations, including the particular steps of the method of FIGS. 6A-C, this disclosure contemplates any suitable method for performing primary visibility computations, including any suitable steps, which may include all, some, or none of the steps of the method of FIGS. 6A-C, where appropriate. Furthermore, although this disclosure describes and illustrates particular components, devices, or systems carrying out particular steps of the method of FIGS. 6A-C, this disclosure contemplates any suitable combination of any suitable components, devices, or systems carrying out any suitable steps of the method of FIGS. 6A-C.
One advantage of the present embodiments is that it may be implemented within a conventional graphics pipeline. At a high level, an application in need of rendering may issue instructions to a graphics driver, which in turn, may communicate with an associated GPU. Through the graphics application programming interface (API) of the driver, the application may specify how a scene should be rendered. For example, the application may submit geometry definitions that represent objects in a 3D space for which a scene is to be generated. In particular embodiments, the application may also submit a ray buffer that defines how location and trajectory of rays. To avoid needing a significant amount of memory to store the ray buffer and the runtime cost of reading/writing the buffer, particular embodiments may further allow a user to specify a procedure definition that may be used to procedurally generate rays at runtime. Based on the information provided, the rendering system may perform visibility computations, such as using raycasting, as previously described. Once the visibility of the geometries has been determined, the system may proceed with shading and outputting the final scene.
The flexibility provided by the embodiments described herein enable a rendering system to naturally implement a variety of rendering features relevant to virtual reality or augmented reality, in contrast to existing systems that simulate such effect using post-processing. Moreover, existing graphics APIs focus on a specific case of a uniform grid of primary sampling points. While current hardware rasterizers are highly tuned for this use case, rendering for AR/VR displays requires additional high-performance functionality that is more naturally achieved by raycasting in accordance with particular embodiments. Particularly, the existing graphics APIs are incapable of handling the following cases: (1) direct subpixel rendering through known optical distortion on different subpixel arrangements; (2) varying multi-sample and/or shading rate across the screen (e.g., for foveated rendering/scene-aware work rebalance); (3) depth-of-field sampling patterns for depth-of-field approximation for varifocal displays; and (4) beam racing.
Particular embodiments described herein support the aforementioned use cases based on the following features, which may be implemented as enhancements to the visibility-determination portion of the graphics pipeline. For example, to support optical distortion rendering on different subpixel arrangements, the system may allow an application using the rendering engine to specify and render with non-uniform grid sampling patterns and use independent color channel sampling. To support varying multi-sample/shading, the system may allow an application to specify a measure of “importance” per pixel/tile/block. To support depth-of-field, the system may allow an application to specify and render with non-uniform grid sampling patterns and non-point-origin sampling patterns. Each of these features, along with beam racing, will be described in further detail.
At a high-level, when any kind of lens distortion is desired, the distortion may be applied to the rays (e.g., determining ray directions) before bounding beams are computed via principal component analysis. When subpixel rendering is enabled, red, green and blue channels may be considered using separate grids and tiled and blocked separately. When depth of field is present, the beams may be expanded to accommodate the distribution of ray origins. When using a foveated ray distribution, ray bundles may be generated using a top-down divisive algorithm to build tiles containing no more than n pixels (e.g., 128) and blocks containing no more than m (e.g., 64) tiles. The system may support partially occupied tiles and blocks. For most use cases, these bounds may be computed once, at the beginning of time, based upon the lens parameters of the system. However, in the case that parameters change between frames, such as the point of attention during foveated rendering, they may be recalculated on a per-frame basis. The bounding beams may bound the entire footprint of every pixel, rather than just their centers, to support MSAA.
Particular embodiments enable a graphics system to support direct optical-distortion rendering and subpixel rendering. One of the primary differences between head mounted and traditional displays is the use of viewing optics. In addition to allowing a user to focus on the display, the viewing optics add a variety of aberrations to the display as viewed. Notably, head mounted displays usually produce a pin-cushion distortion with chromatic dependency, which causes both color separation and non-uniform pixel spacing. This leads to the user effectively seeing three different displays, one for each color (e.g., red, green, and blue), with three different distortion functions. Traditionally, these artifacts may be corrected during a post-processing image distortion phase. For example, conventional rendering systems, which do not support direct-distortion rendering, would produce a conventional rectangular image. To properly view the image via a head-mounted display, a post-processing stage takes the rectangular image and create a warped image for head-mounted viewing optics. Not only is the conventional multi-stage process inefficient, the resulting effect is suboptimal.