Stackless trees

Pierre · Post by **Pierre** » Tue Oct 24, 2006 11:42 am

Hi,

I have a question about Bullet's stackless trees: are they compatible with the usual raycast optimization where, in search of the closest hit, you go to the closest child first during the tree traversal?

As far as I can see it is not possible to jump from a parent node to the second child directly, because then there's no way to go back to the first child later.

Thus, using the stackless tree seems to prevent a rather important optimization for raycast queries. Which is kind of a shame because stackless is a lot sexier than recursive.

Any idea? Did I miss something?

- Pierre

Erwin Coumans · Post by **Erwin Coumans** » Tue Oct 24, 2006 4:43 pm

Bullet's BVH data structures can be traversed stackless and recursive (it is compatible with both because of 'redundant' pointers for the stackless). For raycast you will probably need to use a recursive traversal, because the stackless traversal implicitly assume fixed order of traversal (always left children then right). Right now, Bullet still keeps the child pointers for the recursive traversal. When you get rid of those, the left child is implicitly the next node, but the right child cannot be directly accessed. So you will need to modify the datastructure to allow direct access to the right child. I haven't looked into this in details but one solution to access the right child for recursive raycast traversals would be to store one additional index.

Note that Bullet doesn't implement raycast optimizations for the BVH, it just performs a culling on the aabb of the entire ray with all child aabb's. This optimizations should be on the todo list.

Thanks,
Erwin

Pierre wrote:Hi,

I have a question about Bullet's stackless trees: are they compatible with the usual raycast optimization where, in search of the closest hit, you go to the closest child first during the tree traversal?

As far as I can see it is not possible to jump from a parent node to the second child directly, because then there's no way to go back to the first child later.

Thus, using the stackless tree seems to prevent a rather important optimization for raycast queries. Which is kind of a shame because stackless is a lot sexier than recursive.

Any idea? Did I miss something?

- Pierre

Pierre · Post by **Pierre** » Wed Oct 25, 2006 11:41 am

Bullet's BVH data structures can be traversed stackless and recursive (it is compatible with both because of 'redundant' pointers for the stackless).

Well, sure, if you keep both the pointers and the escape index, then no problem. But increasing the size of a node like this is not a good idea, and probably not an option in the real world. So, at the end of the day, when the code is final, it has to be one or the other.

I haven't looked into this in details but one solution to access the right child for recursive raycast traversals would be to store one additional index.

Not an option either, I think. At the moment the "stackless node" has the same (theoretical) size as the "normal" node, so that's good. But adding an extra index would make it bigger than necessary, and it would lose its appeal for me.

Note that Bullet doesn't implement raycast optimizations for the BVH, it just performs a culling on the aabb of the entire ray with all child aabb's. This optimizations should be on the todo list.

I am interested in stackless trees in general, not just in the context of Bullet. I am listing the pros and cons, etc. So far they seem to be great for some next-gen consoles, but not that interesting on PC.

- Pierre

Erwin Coumans · Post by **Erwin Coumans** » Wed Oct 25, 2006 4:50 pm

Hi Pierre,

I am thinking of (de)compressing the AABB trees using zlib, to manage the size. On Playstation 3 you can just run it all on the SPU, and while decompressing do a stackless traversal for a batch of primitives.

As I said, I haven't looked much yet into how to access the right child directly, but there must be some solution that doesn't requires additional data. I haven't implemented it yet, but perhaps the zlib (de)compression gives better result then the quantizing.

Another idea in this area of AABB trees is to quantizing the nodes/aabb's and perform the aabb versus aabb check on the quantized aabb and not converting the quantized to floating point.

Thanks for the feedback, I'll try to bring up some solution,
What is the size of a single node that you are targetting? still 20 bytes?
Erwin

Pierre wrote:
Bullet's BVH data structures can be traversed stackless and recursive (it is compatible with both because of 'redundant' pointers for the stackless).
Well, sure, if you keep both the pointers and the escape index, then no problem. But increasing the size of a node like this is not a good idea, and probably not an option in the real world. So, at the end of the day, when the code is final, it has to be one or the other.

I haven't looked into this in details but one solution to access the right child for recursive raycast traversals would be to store one additional index.
Not an option either, I think. At the moment the "stackless node" has the same (theoretical) size as the "normal" node, so that's good. But adding an extra index would make it bigger than necessary, and it would lose its appeal for me.

Note that Bullet doesn't implement raycast optimizations for the BVH, it just performs a culling on the aabb of the entire ray with all child aabb's. This optimizations should be on the todo list.
I am interested in stackless trees in general, not just in the context of Bullet. I am listing the pros and cons, etc. So far they seem to be great for some next-gen consoles, but not that interesting on PC.

- Pierre

Erwin Coumans · Post by **Erwin Coumans** » Sun Oct 29, 2006 6:49 pm

Selecting one of the other children as first traversal for raycast makes the worst case worse. I doubt you need this 'optimization'. Better would be to randomize the children, so you get a good average case.

Also, doing a stackless skip-list traversal allows you to perform calculations on the sub-trees in parallel (6 times on a SPU, or many more times on a GPU).

Is this optimization really worth the hassle?

How many bytes are you using for your stackless Node? Can you please give the members you are using?

Thanks,
Erwin

Pierre · Post by **Pierre** » Mon Oct 30, 2006 9:17 am

Selecting one of the other children as first traversal for raycast makes the worst case worse.

What is the worst case? When the ray doesn't touch anything? Then this is correct, in this case the extra CPU time spent on ordering the nodes is wasted. However, sorting the 2 candidate nodes is very cheap (2 dot products) and in practice the overhead is difficult to notice (as you know it's nothing compared to memory access).

I doubt you need this 'optimization'.

You only say that because it makes your life easier, but you would include it in Bullet immediately if it was easy

The optimization is not "vital", but it -definitely- helps.

There is a fundamental problem with the "2 passes" approach we talked about a while back. You said that Opcode would be cleaner if things were separated in distinct phases:
1) traverse the tree, gather touched triangles
2) perform the actual test against reported triangles

This is certainly cleaner, but for raycasts it makes a few optimizations difficult to implement. For example if you traverse closest nodes first and shrink the ray according to current best distance, you stop the traversal earlier (culling a lot of nodes that would otherwise have been visited).

Let's take a pathological case: imagine your mesh is just a vertial pile of triangles, and your ray is vertical as well, touching all of them. If you do 1), then you end up reporting all the nodes of the tree (touching the full tree memory) before performing a single raycast in 2). However, if you do both at the same time, and implement the optimizations I mentioned, then it's likely that you will only fetch a single node (or the few first nodes), and then early exit because the shrunk ray doesn't touch the remaining nodes.

I found that this optimization does help in real cases - at least in a SW version.

Better would be to randomize the children, so you get a good average case.

I doubt this is really useful.

Also, doing a stackless skip-list traversal allows you to perform calculations on the sub-trees in parallel

Maybe, but I think this is off-topic

Is this optimization really worth the hassle?

With usual trees it certainly is, especially since it's very easy to implement. With stackless trees, no, it is not.

How many bytes are you using for your stackless Node?

I'm not the one who rewrote them, but I think it's 20 bytes.

- Pierre

Erwin Coumans · Post by **Erwin Coumans** » Mon Oct 30, 2006 3:25 pm

This is certainly cleaner, but for raycasts it makes a few optimizations difficult to implement. For example if you traverse closest nodes first and shrink the ray according to current best distance, you stop the traversal earlier (culling a lot of nodes that would otherwise have been visited).

It depends which target architecture. Perhaps on PC the recursive AABB-tree traversal works very well interleaved with ray-triangle checks.
For next-gen consoles like Playstation 3 random access is much more tricky then a steaming solution (stackless).

Shortening the ray during traversal is very useful indeed, ideally you can prefetch the triangle data (and overlap the DMA/LV2 cache miss with more traversal).

If storing an extra skip-index for the right-child is too much memory, you could solve this problem by compressing the Node even more: you can save memory by just storing the differences in AABB with the parent, instead of the full object-space AABB.

I've seen some good results using this idea in another SDK. Have you considered delta-compression?
Thanks,
Erwin

Pierre · Post by **Pierre** » Mon Oct 30, 2006 4:30 pm

If storing an extra skip-index for the right-child is too much memory

Actually, storing an extra index might not be such a big deal. But does it solve the problem? I'm not sure I see how a second index helps. There is a node, we decide to jump to the right-child.... and then what? How do we ever backtrack to the left-child? Note that there might be a way, I just don't see it right now.

you could solve this problem by compressing the Node even more: you can save memory by just storing the differences in AABB with the parent

Well, since you mention it.... Isn't it also an issue with stackless trees? How do you pass a parent node pointer to the children? I didn't find an obvious way, except if storing the pointer directly in the node itself (which is obviously not a great way to save memory...)

I've seen some good results using this idea in another SDK. Have you considered delta-compression?

Yes, theoretically, but never actually tried. I don't see how to do that with stackless trees anyway....

- Pierre

Erwin Coumans · Post by **Erwin Coumans** » Mon Oct 30, 2006 4:38 pm

Pierre wrote:
If storing an extra skip-index for the right-child is too much memory
Actually, storing an extra index might not be such a big deal. But does it solve the problem? I'm not sure I see how a second index helps. There is a node, we decide to jump to the right-child.... and then what? How do we ever backtrack to the left-child? Note that there might be a way, I just don't see it right now.

I meant sharing the same AABB tree datastructure, so you can use stackless tree traversal for discrete/AABB overlap, but also use recursive traversal for raycasts (using this additional right child-index).

you could solve this problem by compressing the Node even more: you can save memory by just storing the differences in AABB with the parent
Well, since you mention it.... Isn't it also an issue with stackless trees? How do you pass a parent node pointer to the children? I didn't find an obvious way, except if storing the pointer directly in the node itself (which is obviously not a great way to save memory...)
I've seen some good results using this idea in another SDK. Have you considered delta-compression?
Yes, theoretically, but never actually tried. I don't see how to do that with stackless trees anyway....

- Pierre

Stackless+ delta decompression: During stackless traversal, one solution would be to keep the current transformstack explicitly, and rebuild it (using delta compression/decompression) while traversing. It is not easy, and nodes will probably have variable size, but the skip-indices can compensate for this. (by the way, this is brainstorming, I haven't implemented this)

Hope this helps,
Erwin

Christer Ericson · Post by **Christer Ericson** » Sat Nov 04, 2006 7:21 am

I can imagine different definitions for "stackless trees." How are the stackless trees you guys are discussing set up? What are the perceived benefits and usage-cases?

Erwin Coumans · Post by **Erwin Coumans** » Sat Nov 04, 2006 8:05 am

Christer Ericson wrote:I can imagine different definitions for "stackless trees." How are the stackless trees you guys are discussing set up? What are the perceived benefits and usage-cases?

Hi Christer,

Good to see you again!

The trees are store in an array and skip indices are inserted in the nodes to skip parts of the tree/array that have no AABB overlap. In my profiling comparisons, stackless tree traversals are more efficient then recursive traversals on Playstation 3 SPU (I heard on XBox 360 too), and they can also be used on GPU. Benefits are streaming access (only moving forward in the node-array, no random access), no (stack) memory access needed to save 'return info' for the recursion and easier to parallelize.

Unoptimized stackless trees (setup+traversal) implementation in Bullet:
http://www.continuousphysics.com/Bullet ... tml#l00270

It is not a novel idea, and several older papers discuss it, see page 35,36 for a description:
http://www.continuousphysics.com/ftp/pu ... 001-06.pdf

More recent GPU implementation (master thesis):
http://www.continuousphysics.com/ftp/pu ... thesis.pdf

Typical use case is collision detection and raycasting against static triangle meshes (preferably using batching and double/multi-buffering to hide the latency of the DMA transfer/L2 cache misses).

Real-Time Physics Simulation Forum

Stackless trees

Stackless trees

Re: Stackless trees