We propose a novel algorithm for construction of bounding volume hierarchies (BVHs) for multi‐core CPU architectures. The algorithm constructs the BVH by a divisive top‐down approach using a progressively refined cut of an existing auxiliary BVH. We propose a new strategy for refining the cut that significantly reduces the workload of individual steps of BVH construction. Additionally, we propose a new method for integrating spatial splits into the BVH construction algorithm. The auxiliary BVH is constructed using a very fast method such as LBVH based on Morton codes. We show that the method provides a very good trade‐off between the build time and ray tracing performance. We evaluated the method within the Embree ray tracing framework and show that it compares favorably with the Embree BVH builders regarding build time while maintaining comparable ray tracing speed.
Categories and Subject Descriptors (according to ACM CCS):; I.3.7 Computer Graphics: Raytracing; I.3.5 Computer Graphics: Object Hierarchies; I.3.7 Computer Graphics: Three‐Dimensional Graphics and Realism
Bounding volume hierarchies (BVHs) are one of the most efficient spatial data structures for organizing scene primitives. Most contemporary ray tracing implementations use BVHs for accelerating ray scene intersections. This makes the task of constructing high‐quality BVH, i.e. a BVH leading to a low number of intersection evaluations thus high ray tracing speed, very important. This problem has been studied very intensively and currently a number of very efficient solutions exist that lead to high‐quality BVHs [[
The BVH can be constructed in O(nlogn) time even for the relatively expensive full sweep method based on surface area heuristic. Recently several methods were proposed which lead to BVHs with a lower cost (i.e. expected number of intersections) than that of full sweep SAH [[
In this paper, we revisit the idea of Hunt et al. [[
Starting from a triangle soup, we first build an auxiliary BVH using a very fast BVH builder. This gives us quick access to the scene structure, i.e. a coarse description of the spatial distribution of the primitives. At the core of the method, we use a new adaptive method for accessing the auxiliary BVH during the construction of the final one. To further improve the quality of the constructed BVH for ray tracing, we propose a fast way of integrating spatial splits in the BVH construction process. We show that this approach is competitive with the fastest available BVH builders for multi‐core architectures, such as AAC [[
Bounding volume hierarchies have a long tradition in rendering and collision detection. Kay and Kajiya [[
A number of parallel methods for BVH construction have been proposed for both GPUs and multi‐core CPUs. Lauterbach et al. [[
Karras [[
Significant interest has been devoted to methods which construct high‐quality BVHs albeit at increased computational time compared to the fastest builders. Walter et al. [[
Hunt et al. [[
Due to the widespread availability of SIMD instructions on today's CPU architectures, it is beneficial to construct multi‐BVHs, i.e. hierarchies with a higher branching factor. Wald et al. [[
The paper is further structured as follows: Section 3 presents an overview of the proposed method, Section 4 presents the details of the proposed method, Section 5 gives the results and their discussion, and Section 6 concludes the paper.
Constructing an efficient BVH is an optimization problem equivalent to hierarchical clustering. The core issue for efficient BVH construction is the large amount of data that has to be organized at the beginning of the computation. A number of successful methods gain speed and efficiency by simplifying the initial phase of the computation through quick partitioning of the data into subsets that are later processed using a more sophisticated algorithm. This is the case of the HLBVH [[
Another example of handling the initial scene complexity is the build‐from‐hierarchy method proposed by Hunt et al. [[
We follow the path visible in the above‐discussed techniques with the aim of providing a highly scalable and efficient BVH construction algorithm. Similarly to HLBVH, Bonsai, or AAC, we use Morton code based spatial sorting to establish the initial scene partitioning. Following Hunt et al. [[
Our algorithm starts by constructing an auxiliary BVH that provides a scalable description of the scene structure. We use the LBVH algorithm [[
The construction of the final BVH starts by finding the initial cut of the auxiliary BVH. This cut is formed as the smallest set of nodes which spans the BVH horizontally and the nodes have their surface smaller than a particular threshold. At each step of the algorithm, the nodes on this cut are examined and we search for the best partitioning of these nodes into two sets. We use full sweep SAH in all three axes to find the partitioning. As the size of the cut is small compared to the scene size, this search is very fast. Then the cut is partitioned using the best split found and two new cuts are formed. These cuts are refined according to a new threshold corresponding to the reached subdivision depth. Should a node on the cut be refined, we just replace the node in the cut with its children. The algorithm then continues by applying the method recursively in the new subtrees with the corresponding refined cuts. The illustration of the main steps performed on the auxiliary BVH cut, when building a node in the final BVH, is shown in Figure [NaN] .
This section describes the proposed method in detail by presenting and discussing its individual components. For now, let us assume that we have an auxiliary BVH at our disposal. The details of constructing the auxiliary BVH will be given at the end of this section.
Our method starts with identifying the nodes of the auxiliary BVH that will form a cut of the hierarchy. This cut contains nodes that are just below a given threshold on their surface area. We will discuss how to determine this threshold in Section 4.3. The threshold is set in a way that it provides an initial cut with size below a given hard bound (e.g. 2048 nodes); which is, in general, several orders of magnitude lower than the total number of scene primitives.
The algorithm for finding the cut uses a priority queue and a simple test that checks whether the visited node has a surface area larger than a threshold. If this is the case, its children are put into the priority queue. If this is not the case or the node is a leaf it is appended to the cut. When there are no more nodes to visit or the cut together with the unprocessed nodes have reached the maximum cut size, we terminate the cut search.
The core of the method is splitting the current BVH cut into two disjoint sets. As the cut is small, we can use a relatively expensive evaluation of the best split without significant performance penalty. Thus, we use full‐sweep SAH in all three axes to subdivide the current cut. We sort the cut along all three axes based on node centroids. Then for all axes, we perform a sweep that computes a cost of the node subsets on the left and right of the sweep plane. For a given position i of the sweep plane, the cost is given as
where S
There is a minor difference from the traditional SAH cost evaluation: we use the numbers of nodes n
By subdividing the BVH cut into two parts, the number of nodes in each of the two new cuts is reduced. Repeating this step several times would lead to cut size of 1, and thus we would lose all the information about the scene structure that the cut can provide. Hence we refine the cut to maintain its properties. Hunt et al. [[
Instead, we propose an adaptive way of refining the cut that aims to progressively reduce the cut size, and thus to better balance the computational effort among different levels of the BVH. This approach is similar to adapting the number of clusters handled by the AAC algorithm [[
Our method is based on thresholding the surface area of nodes forming the cut while taking into account the current depth. The cut is refined as follows: for each node forming the cut, we compare its surface area with the threshold. If the surface area of the node is greater than the threshold, the node is replaced by its children. Otherwise, the node is kept in the cut. When refining the cut, we intentionally descend just one level in the tree. This simplifies the implementation as no stack is needed to refine the cut.
We propose to use an adaptive threshold that is based on the depth of the currently computed node in the BVH. The threshold is computed as
where S is the surface area of the scene bounding box, d is the current depth, α and δ are parameters of the method. The δ parameter determines the size of the initial cut for d = 0. The α parameter describes how quickly will the cut size shrink with increasing depth. The setting of these parameters depends on the desired trade‐off between the build time and the trace speed. We used two settings of these parameters: α = 0.5 and δ = 6 for fast builds (PHR‐Fast) and α = 0.55 and δ = 9 for high‐quality builds (PHR‐HQ).
Note that the formula is inspired by regular subdivision of the scene into non‐overlapping voxels. In this case, the size of the voxel in depth d is given as. The bounding boxes of the input hierarchy do not follow this ideal subdivision case, but keeping the working set of nodes in the cut with areas staying close to this function proved beneficial for the performance vs. quality trade‐off of the algorithm.
An example of the lengths of the cut obtained using the proposed adaptive threshold function is given in Figure [NaN] . As we show and discuss in Section 5, the adaptive threshold leads to consistently better results than a predefined cut size originally proposed by Hunt et al. [[
Spatial splits proposed by Stich et al. [[
When splitting a cut, we also evaluate its result when applying spatial splits on the cut nodes. This contrasts with the traditional spatial splits evaluation which is performed directly on geometric primitives (triangles).
The actual spatial split evaluation is done using an algorithm similar to the kD‐tree construction. We sort the boundaries of the node bounding boxes and perform plane sweep while evaluating the split cost. Note, however, that in this case, the classification of the node into the left and right subtree is not based on the centroid of the node but on the two extents of the node bounding box in the given axis. A node is classified as lying in both left and right subtrees if the sweep plane intersects its bounding box. For each cost evaluation, we clip the left and right bounding boxes by the sweep plane which potentially reduces the respective areas (S
where S
If the spatial split cost C
If spatial splits are used, it can happen that some nodes on the refined cut do not intersect the clipped bounding box passed over with the current cut. These nodes are simply skipped, i.e. they are not placed into the refined cut. An illustration of a spatial split applied on the current cut is shown in Figure [NaN] .
To avoid performing spatial splits at lower parts of the tree where their benefit is usually low, we use a simple rule proposed by Stich et al. [[
The proposed spatial splits method is very simple and requires minimal modifications of the other parts of the code. The results show that this method, although being very fast, improves on the trace performance up to 20%.
The trace performance of the BVH can be improved by increasing the branching factor of interior nodes, which results in a shallower BVH [[
The parallelization of the method is relatively straightforward. We use a shared work queue which contains entries representing un‐constructed parts of the tree. Each entry contains an index of a node in the new BVH and an array representing the corresponding cut in the auxiliary BVH. A thread pops an entry from this queue, finds a subdivision of the cut and if the node does not become a leaf, it stores the right child in the shared queue so that other threads can fetch this entry and process it. The left branch is processed immediately by the same thread unless it is a leaf node. Note that to prevent large synchronization overhead, we only store the nodes in the shared work queue up to a given depth (e.g. 12) or when we find out that some threads are idle (using a counter of active threads).
We use parallel sorting based on Morton codes and subsequent binary search to identify the nodes of the auxiliary BVH. We first compute Morton codes for all triangles. For that, each thread is assigned a continuous span of ⌈n/t⌈ triangles, where n is the number of scene triangles and t is the number of threads. For parallel sorting, we use approximate parallel bucket‐sort. In each thread, we insert the primitives into k buckets. We used k = 2
We implemented the proposed method in C++ using standard language constructs. In the current implementation, we did not exploit SIMD instructions. We performed a series of tests comparing the build times, BVH costs, and the ray tracing performance of our method and several reference methods on nine test scenes. As a reference, we used our implementations of the method of Hunt et al. [[
The trace times were evaluated using a simple path tracer provided as a tutorial in Embree. The times are computed as average times for three different representative views of each scene using 1024×1024 resolution and 1 sample per pixel. The measurements were performed using 16 working threads on a PC equipped with 2×Intel Xeon E5‐2620. The measured results are summarized in Table [NaN] . Below we discuss the results from different points of view.
Performance comparison of the tested methods. The reported numbers are averaged over three different viewpoints for each scene. For computing the SAH cost, we used c T = 1 and c I = 1. The leaf cost is computed using the Embree cost model, i.e. triangle quartets are considered a single primitive.
The table shows that the lowest build times among the reference methods are achieved by the Embree Morton method, usually followed by Hunt‐50, AAC, Embree SAH, Hunt‐500, and Embree Fast‐Spatial. Our PHR‐Fast method provides build times usually between Embree Morton and Hunt‐50; thus, it is the second fastest builder tested. The PHR‐HQ method has build times mostly between Embree SAH and Hunt‐500. The PHR‐HQ build times are about twice lower than the build times of Embree Fast‐Spatial.
On scenes with a simple structure (e.g. Buddha), the PHR‐Fast and PHR‐HQ methods perform comparably to Hunt‐50 and Hunt‐500, respectively. On larger scenes, PHR‐Fast and PHR‐HQ achieve about 20% lower build times than Hunt‐50 and Hunt‐500 while leading to slightly higher quality BVHs.
The overall highest BVH quality in terms of trace times was achieved by the Embree Fast‐Spatial method. The PHR‐HQ method closely follows in half of the test scenes. In two tested scenes (Hairball and Soda Hall), PHR‐HQ actually provided marginally better trace performance. In four test scenes, PHR‐HQ provides about 10%–20% lower trace performance than Embree Fast‐Spatial.
The PHR‐Fast method provides trace performance usually between Embree Morton and Embree SAH while being closer to Embree SAH in terms of trace speed. The build times on the other hand are closer to Embree Morton, which makes the PHR‐Fast method a good candidate for interactive applications.
We provide a graphical comparison of build time vs. trace time for all tested methods in Figure [NaN] . The figure also sketches the Pareto front known from multi‐objective optimization [[
Spatial splits provide trace speedup particularly for large architectural scenes with primitives of different sizes. The influence of spatial splits can be seen by relating to the Hunt‐50 and Hunt‐500 methods, which do not employ spatial splits. The trace speedup due to our fast spatial splits method is in the range of 2% and 20%. Spatial splits had the largest positive impact in the San Miguel scene.
Our node level spatial splits could also be combined with triangle presplitting [[
Prior to determining the settings for PHR‐Fast and PHR‐HQ, we conducted a series of measurements to evaluate the dependence of the method on the α and δ parameters. We tested values of α slightly below ⅔ (an explanation of this threshold is given at the end of Section 4.3). We determined the range of measured δ values by balancing the amount of work done at each subdivision step with the potential of finding a precise enough split while focusing on the splits near the root. In particular, we tested all the combinations of α and δ from the sets {0.45, 0.5, 0.55, 0.65} and {5, 6, 7, 8, 9, 10}, respectively. From these measurements, we selected the two characteristic settings (PHR‐Fast, PHR‐HQ) that roughly defined the Pareto front for all scenes. In a selected subset of scenes, the representative parameter settings can be slightly different, and thus better performance/quality ratios could be achieved. For example, in San Miguel and Power Plant scenes, the optimal values are α = 0.5, δ = 7 for the high quality scenario (saving about 35% build time while actually decreasing the trace time by about 3%). In Conference and Hairball scenes, it is α = 0.5, δ = 5 for the fast build scenario (saving 9% and 33% build time, respectively, while keeping the same trace time).
We have tested the proposed method with other BVH builders for constructing the initial BVH (AAC and Binning‐SAH); the results were generally worse concerning Pareto optimality. For example, using AAC allowed us to build a BVH with a slightly lower cost (a few percent) for the same parameter settings of PHR; using different PHR parameters, a similar quality BVH could be constructed faster.
The results indicate that the method is configurable in a large range of build time vs. BVH quality trade‐off. On one side, this is a positive feature; on the other hand, this makes it difficult to find optimal parameters (α and δ) for the PHR method. We used two representative settings, but we observed that in a number of scenes setting different values to these parameters would provide better build times or trace performance while not significantly altering the other value.
The PHR‐Fast method seems a good fit for interactive applications. In fact, we could avoid constructing the auxiliary BVH from scratch and reuse it over multiple frames by simple refitting to further reduce the build time of PHR‐Fast. This strategy was already suggested by Hunt et al. [[
The current implementation of the method uses a moderately optimized C++ code, but we did not yet exploit SIMD constructs. We expect that by exploiting SIMD we could further lower the build times while keeping the same BVH quality. On a similar note, the sweep SAH algorithm used at the core of our method could be replaced by SIMD optimized binning SAH.
We proposed a novel scalable algorithm for BVH construction on multi‐core CPU architectures. The algorithm employs a two‐phase process: first it constructs an auxiliary BVH using the LBVH algorithm, followed by constructing the final BVH using progressively refined cuts of the auxiliary BVH. The progressive refinement of the cut size is driven by adapting surface area thresholds based on the current depth of the constructed node. We provided a simple way of integrating spatial splits in the BVH construction process.
The results show that the method yields superior build performance compared to the high‐quality builders implemented in the Embree framework while closely matching their ray tracing performance. In comparison with the strategy of Hunt et al. [[
There are a number of avenues for future work. We see a great potential in the integration of static and dynamic content by using the progressive hierarchical refinement to merge a high‐quality static hierarchy and dynamically constructed auxiliary BVH. This technique could have immediate applications in the video game technology. The proposed method scales in a large build time vs. quality range. Finding optimal parameters for a given scene and target frame rate in interactive applications remains an open problem.
This research was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS16/237/OHK3/3T/13.
Graph: The San Miguel scene rendered in the Embree path tracer [ WWB*14 ]. Visualization of the number of traversal steps for primary rays using BVHs from different builders (the red color corresponds to 100 traversal steps per ray). From left‐to‐right: Embree SAH, our PHR‐Fast, our PHR‐HQ, and Embree Fast‐Spatial. The bottom row shows the build times. Our PHR‐Fast method provides 1.6× lower build time than Embree SAH, while the PHR‐HQ method has 2× lower build time than Embree Fast‐Spatial. In both comparisons, the builders provide equivalent ray tracing performance.
Graph: image%5ft/cgf13143-fig-0001-t.gif
Graph: Overview of the method. Fast LBVH builder constructs the auxiliary BVH, which is used to build the final high‐quality BVH using progressive hierarchical refinement (PHR).
Graph: image%5ft/cgf13143-fig-0002-t.gif
Graph: Illustration of the hierarchical cut refinement. (left) When computing the split, we determine the optimal subdivision of the current cut (cut 0 ) based on full sweep SAH. That results in two sets of nodes on the current cut which correspond to the nodes that will be contained in the left and right subtree of the subdivided cut (shown in green and blue). (center) The nodes on the cut are reordered and split into two disjoint cuts. (right) The new cuts (cut 1 and cut 2 ) are formed by hierarchical refinement based on their surface area.
Graph: image%5ft/cgf13143-fig-0003-t.gif
Graph: The average cut size at different BVH depths for the San Miguel scene. PHR‐HQ (green) works on substantially larger cuts than PHR‐Fast (purple). The adaptive threshold function implies larger cut size at the top of the tree, which has a crucial impact on the overall tree quality.
Graph: image%5ft/cgf13143-fig-0004-t.gif
Graph: Illustration of a spatial split. A spatial split is selected for nodes forming the current BVH cut. Two subsets of nodes are formed for the left and right subtree, and the bounding volumes of these subsets are clipped by the splitting plane. The node denoted by the * symbol is split by the split plane and therefore it is placed in both subsets. Left and right subsets are refined to form new cuts. Particularly, the node * is replaced by its children (see the images in the bottom). Note that for the left subset, one of the child nodes (shown in red) does not intersect the currently clipped bounding volume, and thus it is discarded from the cut in this branch. The clipped bounding box can thus be shrunk to reflect this.
Graph: image%5ft/cgf13143-fig-0005-t.gif
Graph: The relative performance of different methods with respect to Embree SAH as the reference method and averaged over all test scenes.
Graph: image%5ft/cgf13143-fig-0006-t.gif
By J. Hendrich; D. Meister and J. Bittner