Abstract: Bounding volume hierarchy (BVH) has been widely adopted as the acceleration structure in broad‐phase collision detection. Previous state‐of‐the‐art BVH‐based collision detection approaches exploited the spatio‐temporal coherence of simulations by maintaining a bounding volume test tree (BVTT) front. A major drawback of these algorithms is that large deformations in the scenes decrease culling efficiency and slow down collision queries. Moreover, for front‐based methods, the inefficient caching on GPU caused by the arbitrary layout of BVH and BVTT front nodes becomes a critical performance issue. We present a fast and robust BVH‐based collision detection scheme on GPU that addresses the above problems by ordering and restructuring BVHs and BVTT fronts. Our techniques are based on the use of histogram sort and an auxiliary structure BVTT front log, through which we analyze the dynamic status of BVTT front and BVH quality. Our approach efficiently handles inter‐ and intra‐object collisions and performs especially well in simulations where there is considerable spatio‐temporal coherence. The benchmark results demonstrate that our approach is significantly faster than the previous BVH‐based method, and also outperforms other state‐of‐the‐art spatial subdivision schemes in terms of speed.
CCS Concepts; •Computing methodologies → Collision detection; Physical simulation
Collision detection (CD) has usually been a major performance bottleneck in physically‐based computer simulations. The simulator needs to check for not only potential primitive overlaps between pairs of objects but also numerous self‐collisions within each model. This challenge is also encountered in other fields like haptics, robotics, and manufacturing.
Generally, a CD pipeline consists of several levels of filtering processes, typically a broad‐phase step followed by a narrow‐phase step. The broad‐phase step coarsely excludes large parts of primitives that cannot collide according to their bounding volumes (BVs) and generates a potential collision set [[
There are two main approaches used in broad‐phase CD: spatial subdivision and object partition. Spatial subdivision algorithms subdivide the space into cells. Many auxiliary data structures have been researched, including binary space partitioning (BSP) trees (including k‐d trees), grids, hierarchical spatial hashing tables, etc [[
Many parallel algorithms have been proposed, which exploit the benefits of GPU architectures for better performance. Several spatial subdivision methods have been developed to accelerate CD. [[
Among BVH‐based techniques, the primary issue is the maintenance of BVHs. In the context of CD, the construction speed is generally preferred over tree quality, therefore linear bounding volume hierarchy (LBVH) [[
However, refitting is not enough for most dynamic scenes consisting of deformable objects. Therefore, many selective restructuring approaches, which trade tree quality for maintenance speed, have been researched and developed. [[
Another practical issue is the memory layout of BVH nodes. [[
There is usually a certain degree of spatio‐temporal coherence between successive frames in physically‐based simulated environments. The traversal path of non‐overlapping nodes of a bounding volume test tree (BVTT), i.e. BVTT front [[
There are other optimization techniques for specific types of deformation, specialized models like hair [[
[[
Motivated by the above issues and techniques, we investigate the optimization of memory layouts of BVHs and BVTT fronts, and present a novel, BVH‐based broad‐phase collision detection scheme on GPU. We make the following contributions:
An efficient computation method for ordering
Instead of performing an expensive comparison sort on BVH and BVTT front, we build our ordering scheme on top of histogram sort and make adjustments to compute a cache‐friendly layout for these nodes in parallel. As a result, our CD traversal gains significant speedup. Additionally, this technique is integrated with our maintenance operations (including restructuring) and results in considerable speedup. This procedure is at least two orders of magnitude faster than other common GPU sorting algorithms (including radix sort) and the overhead is very small.
An adaptive front‐based collision detection scheme
Conventional front‐based collision detection algorithms lack sensitivity to BVH quality degenerations. We develop a framework that can periodically detect such variations through BVTT front log, which contains the updated statistics of the BVTT front, and can also perform efficient restructuring. Our quality detection method is more correlated with front‐based CD performance than inspecting the status of BVHs due to our novel quality metric, and therefore achieves efficiency and robustness.
In this section, we give an overview of previous work most related to our approach, including fast construction of LBVHs, stack‐less BVH traversal and generalized BVTT front tracking.
Linear BVH Construction: BVHs are now widely used in various computer graphics applications because of their low memory footprint and flexibility in being adapted to model deformation. Unlike ray tracing, collision detection tends to trade tree quality for faster construction. [[
Stack‐less Hierarchy Traversal: Traditionally, a stack (depth‐first order) or a queue (breadth‐first order) is needed for CD depending on a certain traversal rule. In GPU‐based implementations, the storage of these structures is usually declared as arrays inside kernels that reside in local memory, and eventually end up in the DRAMs, resulting in a bandwidth‐limited application.
Another problem in self‐collision detection is that there exists one corresponding duplicate for each collision pair that is the result of switching its two components if no order between the two components of a pair is specified. To eliminate redundant pairwise BV tests, the BV of every primitive should be checked against BVs only enclosing primitives with greater (or smaller) indices.
The stack‐less hierarchy traversal technique handles the above problems well. The first stack‐less hierarchy traversal algorithm was proposed by [[
They also introduced the concept of left child levels value (LCL value) indicating how many straight left children there are above the node (see Figure 2.3), which is essential to our implementation of BVH ordering as discussed in Section 3. The resulting memory layout of our BVH is similar to its pre‐indexed tree (depth‐first order) except that the storage of external nodes (i.e. leaves) is separated from internal nodes.
BVTT Front‐based Traversal: There is extensive research on exploiting spatio‐temporal coherence for faster collision detection. We refer the reader to the literature [[
Algorithm 1: Conventional BVTT front‐based collision detection
A BVH‐based CD pipeline using the incremental scheme was proposed by [[
In this paper we extend the work of [[
In this section we examine the issues that limit the performance of conventional BVH‐based collision detection pipelines. We observe that low memory bandwidth efficiency and divergent parallel executions are priority problems within existing parallel CD algorithms, and accordingly propose a lightweight ordering scheme that sorts all the nodes in BVHs and BVTT fronts in a cache friendly layout for CD traversal.
In a BVH‐based CD pipeline, all input primitives of each object are sorted according to their Morton codes, then an internal hierarchy is built upon these primitives [[
However, reorganizing the BVH nodes in a depth‐first layout can substantially increase the cache hit rate as no related BVH nodes will be fetched from DRAMs into the cache more than once for each individual primitive. For many‐core architectures like GPUs, consecutive threads checking intersections between spatially adjacent (due to primitive sortings) primitive BVs and the same BVH tend to share a similar traversal path, letting DRAMs work close to the peak global memory bandwidth. In fact, matching the BVH layout to its access pattern is an optimization practice of the Coalesced Global Memory Access technique, because global memory is accessed in chunks of aligned 32, 64, or 128 bytes and cached in L1/L2. Furthermore, the structure of array (SoA) layout is enforced on BVH nodes to avoid bandwidth waste. This improvement not only significantly speeds up BVH‐based CD, but also lays the foundation of our efficient front‐based CD. Check all BVH related queries in lines 3, 4, 16, 17, 18, 19, 20, 22, 25, 28, 32, 34, 35, 36, 38, 40, 41 in Algorithm 1.
Previous GPU implementations of BVTT front maintenance generally used atomic add instruction to insert BVTT nodes into front (see lines 24, 42 in Algorithm 1) during sprouting and pruning. Consequently, all newly built nodes are stored randomly in a compacted array. As line 2 in Algorithm 1 suggests, all these nodes are actually independent and each corresponds to a fine‐grained task which is to be scheduled on GPU. More specifically, each front node, i.e. front(i, j), indicates that its thread should traverse the BVH from node j. Because the count of front nodes is numerous, cache miss is also a serious issue for front‐based CD.
Fortunately, layout optimization can be used to improve the performance. Assume adjacent GPU threads start searching from the same BVH node or consecutive BVH nodes and that BVH nodes are also in depth‐first layout, then the same segment of BVH nodes residing within a small region of global memory will usually be concurrently accessed, no matter what operator is executed (sprouting or pruning).
To achieve the above cache friendly layout, a naive yet common method is to sort all the front nodes according to their second components in ascending order by comparison. However, computation overhead of this method is rather high for a real‐time application. Instead, we propose an auxiliary structure order log, composed of a count log and an offset log, to realize a histogram sort on both BVH nodes and front nodes. The central idea of our ordering scheme is as follows:
Since the counting in Step 1 is lightweight and Step 3 is inevitable for any sorting algorithm, the performance gap between this method and others exists in Step 2, specifically the prefix sum computation, whose complexity is linear with the number of segments. Fortunately, its parallel implementation is fast enough that its overhead is almost negligible.
We notice that during depth‐first traversal, the left boundaries of the traversed internal nodes, i.e. the indices of external nodes, are strictly in non‐descending order. Therefore, each external node is treated as a segment and the index of each internal node's left boundary is used as a reference to its corresponding segment. Moreover, the positions of internal nodes within each segment from top to bottom should be in strict ascending order so that the order matches the exact depth‐first order.
Our BVH layout computation starts with counting LCL values of external nodes (see Figure 2.3) in count log. Note that the LCL values in our scheme are off‐by‐one compared to those in [[
After being computed within construction kernel, LCL values are exclusively‐scanned into the offset log. Before going into Step 3, the mapping between original indices and ordered indices of internal nodes (the numbers in the pink boxes shown in Figure 2.3) are calculated by each individual segment (leaf node) from top to bottom using offset log. Finally, all internal nodes are scattered in the ordered array with their tree topology links (parent handle, left/right child handle) updated. The resulting layout suits our stackless depth‐first collision queries very well.
Conventionally, the BVTT front stores all front nodes in a single mega array, and all collision queries initiated from the front begin with accessing random BVH nodes. This obviously results in a huge number of cache misses. As described in Subsection 3.1, ordering front(i, j)s by their second components largely addresses this issue and reduces the memory latency (see pipeline a in Figure ). However, this simple workaround does not take Thread Divergence, more precisely branch divergence and loop divergence, into account. Since the hardware serializes different execution paths (due to branch divergence) inside each warp (a group of threads running concurrently on a multiprocessor), and idles the whole warp until the thread with the highest iteration count (due to loop divergence) finishes, there can be an unexpected performance drag as a result of the animation and collision status.
In our collision detection scheme, BVTT front is only enabled when there exists enough temporal coherence between successive frames; hence the front in the previous frame resembles the one in the current frame. As for each individual front node, the workload in sprout or prune correlated with coherent motion is therefore comparatively small. Even if a certain amount of BVH degeneration happens and causes adaptive front update, the depth‐first layout optimization of front nodes can partially counteract this negative impact, because all the front nodes that need considerable maintenance share the same BVH node index j and are therefore aggregated. On the whole, loop divergence is within our tolerance.
Branch divergence is a more critical issue. As shown in Algorithm 1, branch divergence happens whenever consecutive threads adopt different maintenance operations at lines 5, 7, both of which are essential functions for a robust and adaptive framework. Yet, we can alleviate this negative impact by simply categorizing the front nodes front(i, j)s according the their second components. Those front(i, j)s with j denoting an external (leaf) node of a BVH are grouped in an external front, otherwise, they result in an internal front. This design tactic is based on the fact that, for every front node in the external front, its sprout function generates next to zero overhead because, there are no child nodes left to traverse. Even if the thread detects an intersection and enters the sprout branch, the performance loss due to serialization is negligible. Moreover, the strategy of separating external BVH nodes from internal nodes mentioned Subsection 3.2 helps increase the bandwidth efficiency during respective collision queries within both external fronts and internal fronts.
A more thorough practice of this strategy is illustrated in Figure . While preparing BVTT fronts for the next frame, we mark each node front(i, j) according to its maintenance branch. Those front nodes that execute prune branch are marked 0, otherwise they are marked 1. We combine this 1‐bit mark with the BVH node index j to get the segment key of node front(i, j) and use it for BVTT front ordering. Due to spatio‐temporal coherence, all front nodes that execute a specific operation will likely enter the same branch in the following frame, and these nodes are further arranged in a depth‐first order layout for better cache utilization. In practice, the mark is not embedded in the segment key, but rather indicates the direction in which to insert corresponding elements, i.e. 1 means to insert from back to front and 0 the opposite. Therefore, the quantity of segments in the external order log equals the number of external (leaf) nodes (e.g. 7 in Figure ) and is thus not doubled.
Finally, we observe that the speedup from updating fronts in each frame does not make up for the overhead caused by memory operations at lines 24, 42 in Algorithm 1. Thus, the front update is only active once every few frames in our scheme and we call this maintenance operation preserve fronts (see pipeline b in Figure ).
The BVH‐based collision detection pipeline equipped with our ordering scheme gains remarkable speedup over the previous state‐of‐the‐art approach [[
A robust BVH‐based CD pipeline should be able to detect degenerations in a timely manner and pick out corresponding BVH subtrees before performing further operations. The previous method [[
A better metric should also consider the magnitude of collisions associated with each BVH node. In our front‐based CD, whenever a node degenerates, its related BVTT front nodes most likely sprout during an update front (see Figure ). More precisely, this front(i, j) will expand into several front(i, k)s whose k indicates a node in the subtree of node j. However, the swelling of front nodes could also be the result of an increasing number of collisions with the subset of primitives that node j covers during regular animations. Based on these patterns, we design a metric function that takes the quantity of BVTT front nodes and collision pairs whose second components belong to the same subtree as independent variables. For simplicity, the latter parameter is approximated to the number of external front nodes, so that both items can be measured through the already computed order log for front ordering, referred to as BVTT front log. Since the overall CD performance is fundamentally determined by the former item, our metric should also grow linearly with it. The optimal metric value needs to remain stable as its collision status changes. Our resulting metric Formula 1 is as follows:
i is the index of an internal BVH node. Q(i) is our improved quality metric of node i. a, b are the smallest internal node index and the largest in subtree i (subtree rooted at node i), respectively. s, t are the smallest external node index and the largest in subtree i, respectively. intcnt(j) is the j‐th value of internal front count log, while extcnt(k) is the k‐th value of external front count log. intoffset(i) and extoffset(i) stand for the i‐th value of internal front offset log and external front offset log, respectively. Because of our BVH ordering, indices of all the nodes in the same subtree are consecutive; therefore the sum of front node counts within a subtree is simply the subtraction of two prefix sums in offset log (see Figure ).
We integrate it with several CD pipelines and test our metric on two benchmarks where fronts are activated. The curve (see Figure ) strongly proves the validity of our metric in the way that the overall collision time has a positive correlation with this quality curve. The lower the curve is, the faster its update front is executed, which matches the description of our metric function, i.e. the approximate average number of front nodes it takes to compute an AABB collision pair (mostly between 3 and 4). Another proof is that the reference quality curve of every benchmark remains steady no matter what its collision status is, while the static curve rises under the influence of BVH quality degenerations.
Since each calculation involves only four integers from the offset log already computed in the front ordering phase, the overhead of quality inspection is almost negligible. In practice, it occupies less than 5 percent of overall CD time in all benchmarks. The detection is performed whenever the BVH structure changes, more specifically right after generate fronts or update fronts (see Figure ).
However, there are several regions of interest in Figure . Range 1 shows that the optimal BVH structure upon a set of primitives may not be the one built by [[
The subtree candidates for restructuring are selected during quality inspection in update front. In the beginning of the next frame, the quantity of BVH nodes and related front nodes is then evaluated for further instructions (see Formula 2).
The implementation of BVH restructuring is generally the same as in [[
After a BVH is restructured, a portion of its related BVTT front nodes, the second components of which map to restructured BVH nodes, becomes invalid. The rest are valid ones that can still be directly used in front‐based CD. As shown in Figure , reconstructing both BVHs and BVTT fronts is about one order of magnitude slower than preserve fronts (see Figure ). Therefore, a BVTT front restructuring is expected, as in BVH maintenance.
The central idea of restructure fronts (see pipeline c in Figure ) is that all valid front nodes are marked, compacted, and later handled by Algorithm 1 as usual, while the invalid front nodes are recalculated. This procedure is similar to the previous ordering scheme, except that the invalid front nodes are filtered out and supplemented in an additional phase, during which each invalid front node front(i, j) is examined for further instruction. Normally, If node j covers the leftmost primitive of its subtree, the thread will traverse from front(i, root(j)), otherwise it is discarded. A particular case in self‐collision front is that, if node i is also among the root(j), then node j certainly cannot cover the left boundary (its index is smaller than i) of the restructured subtree. Instead, if node j covers primitive i+1, then the thread will traverse from front(i, i+1) within this subtree. In the end, the resulting front in arbitrary layout is sorted as in Subsection 3.3.
As described in Subsection 3.3, the BVTT front is sorted based on the value of the second component of each front node, and we use histogram sort as our sorting algorithm. To verify its efficiency, we compare it with the implementation of radix sort in Thrust, which is considered to be one of the fastest sorting algorithms for short keys on a GPU. To perform radix sort, the integer keys are extracted and then sorted using the sort_by_key function from Thrust library in CUDA toolkit 8.0. Figure displays the overheads of two sort algorithms in three benchmarks. The table shows that radix sort is approximately one order of magnitude slower than histogram sort.
The improvement from ordering is as expected. Figure compares the execution statistics of the most frequently used kernel preserve‐fronts (see Figure ), the performance of which is governed by our ordering scheme. It implies that both global memory access and warp execution efficiency benefit from the ordering.
We implement two versions of our CD pipeline. The first version is embedded in the ARCSim simulator in [[
Flag: A single piece of hanging cloth (80K triangles, deformable) flutters in constant wind and with gravity.
Sphere: A piece of hanging cloth (66K triangles, deformable) is hit by a forward/backward moving sphere (1K triangles, rigid).
Victor: A character (104K triangles, rigid) holds still in a red dress (54K triangles, deformable).
DressBlue: A character (27K triangles, rigid) dances in a blue dress (34K triangles, deformable).
The other version is a standalone broad‐phase collision detection application. Five publicly available benchmarks also used in OTG [[
Funnel (18.5K triangles): A piece of cloth is squeezed by a ball and falls through the funnel.
Cloth Ball (aka Whirling Cloth in kDet) (92K triangles): A piece of cloth is draped over a rotating sphere.
Flamenco (49K triangles): A female character dances a flamenco in a red dress with multiple layers of cloth.
N‐body* (146K triangles): Hundreds of spheres and five cones collide in a cuboid space.
Dragon* (252K triangles): A bunny bumps into a dragon and breaks it into pieces.
To verify the effectiveness of our ordering and restructuring techniques, we test the following five strategies.
Unordered curve. Constructs the BVH once, refits it in the rest of the frames. Fronts are active without ordering.
Static curve. Constructs the BVH once, refits it in the rest of the frames. Fronts are active and ordered.
Periodic curve. Replaces update fronts with generate fronts in Strategy 2 every second cycle.
Restructure curve. Enables restructuring scheme on BVTT fronts in addition to Strategy 2. Fronts are rebuilt/restructured along with BVH restructuring.
Reference curve. Rebuilds BVHs and BVTT fronts (i.e. “generate fronts”) every frame to retrieve the ideal quality.
Figure demonstrates the CD performances of the above strategies. We can see that front‐based CD algorithms (i.e. update fronts, preserve fronts, etc.) are significantly faster than BVH‐based CD (i.e. generate fronts), and applying our ordering scheme applied on both BVH and BVTT front further brings considerable speedup to front‐based CD based on the comparison between an unordered curve and a static curve.
Although our restructuring scheme has a relatively small effect on the overall CD performance among these benchmarks, its restructure curve is smoother and steadier than its periodic curve. It precisely detects degenerations and only performs restructuring/reconstruction when necessary. Moreover, the quality of the restructured tree is comparable to the reconstructed one. Therefore the restructure curve is faster than the periodic curve and more robust than the static curve due to its adaptivity to degenerate situations.
The summary of performance comparison between several methods is shown in Figure . Both ordering and restructuring improve the overall CD performance including the narrow phase (same implementation as gProximity also used in [[
There are three benchmarks in which our method exhibits marginal performance. In the benchmarks with suffix * (i.e. N‐body and Dragon), the overall performance benefits solely from the BVH ordering. Since most primitives are uniformly distributed in the scene, GPU workloads are balanced in OTG and kDet, while in our algorithm each thread starts from a different leaf node location and ends up searching all at the rightmost branch of the BVH (see Figure ), resulting in divergent workloads and slowdown. The BVTT front is enabled in the ClothBall benchmark, yet it results in insignificant speedup over OTG. According to Figure , the performance is, on average, 1.7ms over the first 70 frames (on GTX1080). Then the efficiency drops significantly during the last few frames when the cloth twists around the ball. Besides the increase in collisions, the inability of our scheme to adapt to such coherence changes affects the performance. In other benchmarks, our scheme takes full advantage of coherence, and is therefore much faster.
The memory footprints of our method are due to BVHs and BVTT fronts. When the number of primitives is 524288 and the upper limit of BVTT front size is set to 12000000 (7000000 for the internal front and 5000000 for the external front), the total amount of memory overhead of our entire CD algorithm is approximately 900MB. However, the largest memory budget (from ClothBall) among all standalone benchmarks is actually less than 700MB, when the number of primitives is 100000. Although the memory overhead is larger than the reported 512MB for a nine‐level counter octree plus 64MB for the type octree in OTG [[
We present a BVH‐based broad‐phase collision detection framework on GPU. It demonstrates a significant speedup over prior collision detection algorithms due to our ordering scheme, as well as adaptivity to performance‐corelative degenerations due to our novel quality metric and efficient restructuring operations. However, there are certain limitations in our scheme and room for improvement.
Limitations: The inherent issue in our scheme is the large memory consumption of BVTT fronts, especially for high resolution scenes. Even though other techniques, e.g., normal cone [[
Room for improvement: The metric currently used to decide restructuring and to pick out degenerated subtrees is entirely local in terms of time. It performs well in smooth animations, but in cases where motions are large, the benefit of restructuring may soon be offset by the overhead of restructuring itself (see Figure ). It is a common challenge for restructuring to evaluate the tradeoff between operation overhead and resulting performance gain. In such scenarios, the ideal timing of restructuring should be at critical events when a long‐term deformation happens, e.g., inelastic motions. Therefore in event‐based simulations, the information of events can also be utilized as a supplement to our metric for performance guarantee.
Another defect is deciding whether the use of BVTT fronts for collision detection is static throughout the whole animation. Once the extent of spatio‐temporal coherence quickly changes, our scheme is not able to dynamically switch BVTT fronts on/off. Even if the evaluation concludes that the pipeline should proceed with BVTT fronts enabled, the fixed length of the maintenance cycle (the length of preserve fronts loop in Figure ) is not able to make adjustments to such coherence variations.
Although the above issues have little negative impact on consistently spatio‐temporal coherent benchmarks, they may become performance barriers and need managing for more general applicability.
Topology operators on models are not currently viable in our GPU version of the ARCSim simulator. Even though we believe our restructuring scheme can support topology changes by inserting/deleting primitives, this feature is subject to the establishment of mapping between previous and current primitives and thus has not been realized yet. We intend to continue this work, and apply it to a wider range of scenarios.
During research, we discovered that BVHs have recently been a hot topic in the field of ray tracing. Several GPU‐based restructuring techniques aiming for higher ray throughput have been proposed [[
We would like to thank the reviewers for their constructive comments, Tongtong Wang and Zhongyuan Liu for the assistance in the paper submission. All benchmarks are from the UNC Dynamic Scene Benchmarks collection. The project is supported in part by the National Key Research and Development Program (2017YFB1002700), NSFC (61572423, 61572424, 61732015), the Science and Technology Project of Zhejiang Province (2018C01080), and Zhejiang Provincial NSFC (LZ16F020003). Dinesh Manocha is supported in part by the 1000 National Scholar Program of China and NSFC (61732015).
Supporting Information
Supporting Information
PHOTO (COLOR): Traversal path of stackless self‐collision detection of primitive‐1 without BVTT front, assuming that the primitive‐1 overlaps with all other primitives in this BVH.
PHOTO (COLOR): BVH Computation 1. LBVH built upon the ordered primitives by [Ape14]. The index of each internal node is that of the external node which splits its range into two halves, i.e. the rightmost leaf index of its left children. 2. The escape indices of all BVH nodes used for stackless traversal. The integer within the braces denotes whether it is the index of an external node, 1 means true, 0 otherwise. All escape indices falling out of the BVH are marked as −1. 3. Sort all internal nodes in a depth‐first traversal order (specified in the pink boxes). Note that prefix sum computation uses the exclusive scan.
PHOTO (COLOR): In this example, we are checking BV overlaps between two primitives and this BVH. All the front nodes are displayed in the left half with respect to its primitive. The right half exhibits our 3‐step ordering scheme. In step 1, front nodes categorized into internal front and external front are originally in arbitrary order and marked with a 1‐bit tag Opt Mark indicating which branch in maintenance function is executed. In step 2, we count the number of node in each segment and compute prefix sums. In step 3, each front node first refer to its corresponding segment by its second component and the Opt Mark tag, then check the offset log to acquire its segment starting position, finally inserted in this segment.
PHOTO (COLOR): During each frame, our CD framework executes one of the numbered operations ranging from 1 to 5. Three kinds of collision detection pipelines are presented above. Pipeline a. This pipeline focuses on GPU cache optimization. We add a BVTT front ordering phase in circle 1 and circle 3. For further speedup, we disable front maintenance in circle 2. Note that whenever the BVH is rebuilt from scratch, this pipeline is reset to circle 1. Pipeline b. Circle 3 usually takes a lot more time than circle 2. In addition to the overhead of front memory operations, the accumulated deformations in circle 2 also bring down the CD efficiency. Therefore circle 4 is injected into the cycle to alleviate thread divergence in circle 3. Pipeline c. The complete pipeline is integrated with our quality inspection and restructuring. The decision for circle 3 to go to 1 or 5 depends on the current degeneration scale measured through our metric.
PHOTO (COLOR): Three quality curves (a static curve, a periodic curve and a reference curve) and corresponding CD performance curves (total CD time per frame on GTX1080) of benchmark funnel. The static curve represents pipeline b in Figure , and the periodic curve is the variation that replaces “update fronts” operation with “generate fronts” every second cycle. The reference curve rebuilds BVHs and BVTT fronts (i.e. “generate fronts”) every frame in order to retrieve the ideal quality under the assumption that the construction algorithm [Ape14] builds the optimal tree structure for CD. The periodic curve shows that “preserve fronts” and “balance fronts” are approximately one order of magnitude faster than “generate fronts”, and the cost of “update fronts” which executes Algorithm 1 is about twice as much as “preserve fronts”. The two circles marked in red and green display the statistics of “update fronts”. The differences in the static curve with the periodic curve suggests that whichever has a higher quality metric takes more time on the CD query.
PHOTO (COLOR): Histogram Sort vs. Radix Sort. Experiment conducted on GTX1080. The time spent on sorting depends linearly on the size of BVTT front, therefore it fluctuates as the front length varies during animations.
PHOTO (COLOR): Impact of Ordering on ClothBall Benchmark. Tested on GTX780. All statistics are gathered through NVidia Visual Profiler 8.0.
PHOTO (COLOR): Five curves of five strategies for Whirling Cloth benchmark tested on GTX1080. The red circles indicate a BVH restructuring event. Since the number of related front nodes are above the threshold, the BVTT fronts at both events are reconstructed right before further degenerations. The time interval between these two events is small, so the benefits are unfortunately soon offset by their overhead.
PHOTO (COLOR): Overall CD Performance. We disable BVTT fronts in certain benchmarks marked with suffix * due to a lack of spatio‐temporal coherence. In other benchmarks, we use pipeline b or c in Figure .
By Xinlei Wang; Min Tang; Dinesh Manocha and Ruofeng Tong