Stackless trees

Pierre
Posts: 67
Joined: Mon Jul 25, 2005 8:56 am

Stackless trees

Post by Pierre »

Hi,

I have a question about Bullet's stackless trees: are they compatible with the usual raycast optimization where, in search of the closest hit, you go to the closest child first during the tree traversal?

As far as I can see it is not possible to jump from a parent node to the second child directly, because then there's no way to go back to the first child later.

Thus, using the stackless tree seems to prevent a rather important optimization for raycast queries. Which is kind of a shame because stackless is a lot sexier than recursive.

Any idea? Did I miss something?

- Pierre
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Stackless trees

Post by Erwin Coumans »

Bullet's BVH data structures can be traversed stackless and recursive (it is compatible with both because of 'redundant' pointers for the stackless). For raycast you will probably need to use a recursive traversal, because the stackless traversal implicitly assume fixed order of traversal (always left children then right). Right now, Bullet still keeps the child pointers for the recursive traversal. When you get rid of those, the left child is implicitly the next node, but the right child cannot be directly accessed. So you will need to modify the datastructure to allow direct access to the right child. I haven't looked into this in details but one solution to access the right child for recursive raycast traversals would be to store one additional index.

Note that Bullet doesn't implement raycast optimizations for the BVH, it just performs a culling on the aabb of the entire ray with all child aabb's. This optimizations should be on the todo list.

Thanks,
Erwin
Pierre wrote:Hi,

I have a question about Bullet's stackless trees: are they compatible with the usual raycast optimization where, in search of the closest hit, you go to the closest child first during the tree traversal?

As far as I can see it is not possible to jump from a parent node to the second child directly, because then there's no way to go back to the first child later.

Thus, using the stackless tree seems to prevent a rather important optimization for raycast queries. Which is kind of a shame because stackless is a lot sexier than recursive.

Any idea? Did I miss something?

- Pierre
Pierre
Posts: 67
Joined: Mon Jul 25, 2005 8:56 am

Post by Pierre »

Bullet's BVH data structures can be traversed stackless and recursive (it is compatible with both because of 'redundant' pointers for the stackless).
Well, sure, if you keep both the pointers and the escape index, then no problem. But increasing the size of a node like this is not a good idea, and probably not an option in the real world. So, at the end of the day, when the code is final, it has to be one or the other.
I haven't looked into this in details but one solution to access the right child for recursive raycast traversals would be to store one additional index.
Not an option either, I think. At the moment the "stackless node" has the same (theoretical) size as the "normal" node, so that's good. But adding an extra index would make it bigger than necessary, and it would lose its appeal for me.
Note that Bullet doesn't implement raycast optimizations for the BVH, it just performs a culling on the aabb of the entire ray with all child aabb's. This optimizations should be on the todo list.
I am interested in stackless trees in general, not just in the context of Bullet. I am listing the pros and cons, etc. So far they seem to be great for some next-gen consoles, but not that interesting on PC.

- Pierre
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

Hi Pierre,

I am thinking of (de)compressing the AABB trees using zlib, to manage the size. On Playstation 3 you can just run it all on the SPU, and while decompressing do a stackless traversal for a batch of primitives.

As I said, I haven't looked much yet into how to access the right child directly, but there must be some solution that doesn't requires additional data. I haven't implemented it yet, but perhaps the zlib (de)compression gives better result then the quantizing.

Another idea in this area of AABB trees is to quantizing the nodes/aabb's and perform the aabb versus aabb check on the quantized aabb and not converting the quantized to floating point.

Thanks for the feedback, I'll try to bring up some solution,
What is the size of a single node that you are targetting? still 20 bytes?
Erwin
Pierre wrote:
Bullet's BVH data structures can be traversed stackless and recursive (it is compatible with both because of 'redundant' pointers for the stackless).
Well, sure, if you keep both the pointers and the escape index, then no problem. But increasing the size of a node like this is not a good idea, and probably not an option in the real world. So, at the end of the day, when the code is final, it has to be one or the other.
I haven't looked into this in details but one solution to access the right child for recursive raycast traversals would be to store one additional index.
Not an option either, I think. At the moment the "stackless node" has the same (theoretical) size as the "normal" node, so that's good. But adding an extra index would make it bigger than necessary, and it would lose its appeal for me.
Note that Bullet doesn't implement raycast optimizations for the BVH, it just performs a culling on the aabb of the entire ray with all child aabb's. This optimizations should be on the todo list.
I am interested in stackless trees in general, not just in the context of Bullet. I am listing the pros and cons, etc. So far they seem to be great for some next-gen consoles, but not that interesting on PC.

- Pierre
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

Selecting one of the other children as first traversal for raycast makes the worst case worse. I doubt you need this 'optimization'. Better would be to randomize the children, so you get a good average case.

Also, doing a stackless skip-list traversal allows you to perform calculations on the sub-trees in parallel (6 times on a SPU, or many more times on a GPU).

Is this optimization really worth the hassle?

How many bytes are you using for your stackless Node? Can you please give the members you are using?

Thanks,
Erwin
Pierre
Posts: 67
Joined: Mon Jul 25, 2005 8:56 am

Post by Pierre »

Selecting one of the other children as first traversal for raycast makes the worst case worse.
What is the worst case? When the ray doesn't touch anything? Then this is correct, in this case the extra CPU time spent on ordering the nodes is wasted. However, sorting the 2 candidate nodes is very cheap (2 dot products) and in practice the overhead is difficult to notice (as you know it's nothing compared to memory access).
I doubt you need this 'optimization'.
You only say that because it makes your life easier, but you would include it in Bullet immediately if it was easy :)

The optimization is not "vital", but it -definitely- helps.

There is a fundamental problem with the "2 passes" approach we talked about a while back. You said that Opcode would be cleaner if things were separated in distinct phases:
1) traverse the tree, gather touched triangles
2) perform the actual test against reported triangles

This is certainly cleaner, but for raycasts it makes a few optimizations difficult to implement. For example if you traverse closest nodes first and shrink the ray according to current best distance, you stop the traversal earlier (culling a lot of nodes that would otherwise have been visited).

Let's take a pathological case: imagine your mesh is just a vertial pile of triangles, and your ray is vertical as well, touching all of them. If you do 1), then you end up reporting all the nodes of the tree (touching the full tree memory) before performing a single raycast in 2). However, if you do both at the same time, and implement the optimizations I mentioned, then it's likely that you will only fetch a single node (or the few first nodes), and then early exit because the shrunk ray doesn't touch the remaining nodes.

I found that this optimization does help in real cases - at least in a SW version.
Better would be to randomize the children, so you get a good average case.
I doubt this is really useful.
Also, doing a stackless skip-list traversal allows you to perform calculations on the sub-trees in parallel
Maybe, but I think this is off-topic :)
Is this optimization really worth the hassle?
With usual trees it certainly is, especially since it's very easy to implement. With stackless trees, no, it is not.
How many bytes are you using for your stackless Node?
I'm not the one who rewrote them, but I think it's 20 bytes.

- Pierre
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

This is certainly cleaner, but for raycasts it makes a few optimizations difficult to implement. For example if you traverse closest nodes first and shrink the ray according to current best distance, you stop the traversal earlier (culling a lot of nodes that would otherwise have been visited).
It depends which target architecture. Perhaps on PC the recursive AABB-tree traversal works very well interleaved with ray-triangle checks.
For next-gen consoles like Playstation 3 random access is much more tricky then a steaming solution (stackless).

Shortening the ray during traversal is very useful indeed, ideally you can prefetch the triangle data (and overlap the DMA/LV2 cache miss with more traversal).

If storing an extra skip-index for the right-child is too much memory, you could solve this problem by compressing the Node even more: you can save memory by just storing the differences in AABB with the parent, instead of the full object-space AABB.

I've seen some good results using this idea in another SDK. Have you considered delta-compression?
Thanks,
Erwin
Pierre
Posts: 67
Joined: Mon Jul 25, 2005 8:56 am

Post by Pierre »

If storing an extra skip-index for the right-child is too much memory
Actually, storing an extra index might not be such a big deal. But does it solve the problem? I'm not sure I see how a second index helps. There is a node, we decide to jump to the right-child.... and then what? How do we ever backtrack to the left-child? Note that there might be a way, I just don't see it right now.
you could solve this problem by compressing the Node even more: you can save memory by just storing the differences in AABB with the parent
Well, since you mention it.... Isn't it also an issue with stackless trees? How do you pass a parent node pointer to the children? I didn't find an obvious way, except if storing the pointer directly in the node itself (which is obviously not a great way to save memory...)
I've seen some good results using this idea in another SDK. Have you considered delta-compression?
Yes, theoretically, but never actually tried. I don't see how to do that with stackless trees anyway....

- Pierre
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

Pierre wrote:
If storing an extra skip-index for the right-child is too much memory
Actually, storing an extra index might not be such a big deal. But does it solve the problem? I'm not sure I see how a second index helps. There is a node, we decide to jump to the right-child.... and then what? How do we ever backtrack to the left-child? Note that there might be a way, I just don't see it right now.
I meant sharing the same AABB tree datastructure, so you can use stackless tree traversal for discrete/AABB overlap, but also use recursive traversal for raycasts (using this additional right child-index).
you could solve this problem by compressing the Node even more: you can save memory by just storing the differences in AABB with the parent
Well, since you mention it.... Isn't it also an issue with stackless trees? How do you pass a parent node pointer to the children? I didn't find an obvious way, except if storing the pointer directly in the node itself (which is obviously not a great way to save memory...)
I've seen some good results using this idea in another SDK. Have you considered delta-compression?
Yes, theoretically, but never actually tried. I don't see how to do that with stackless trees anyway....

- Pierre
Stackless+ delta decompression: During stackless traversal, one solution would be to keep the current transformstack explicitly, and rebuild it (using delta compression/decompression) while traversing. It is not easy, and nodes will probably have variable size, but the skip-indices can compensate for this. (by the way, this is brainstorming, I haven't implemented this)

Hope this helps,
Erwin
Christer Ericson
Posts: 23
Joined: Mon Jun 27, 2005 4:30 pm
Location: Los Angeles

Post by Christer Ericson »

I can imagine different definitions for "stackless trees." How are the stackless trees you guys are discussing set up? What are the perceived benefits and usage-cases?
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

Christer Ericson wrote:I can imagine different definitions for "stackless trees." How are the stackless trees you guys are discussing set up? What are the perceived benefits and usage-cases?
Hi Christer,

Good to see you again!

The trees are store in an array and skip indices are inserted in the nodes to skip parts of the tree/array that have no AABB overlap. In my profiling comparisons, stackless tree traversals are more efficient then recursive traversals on Playstation 3 SPU (I heard on XBox 360 too), and they can also be used on GPU. Benefits are streaming access (only moving forward in the node-array, no random access), no (stack) memory access needed to save 'return info' for the recursion and easier to parallelize.

Unoptimized stackless trees (setup+traversal) implementation in Bullet:
http://www.continuousphysics.com/Bullet ... tml#l00270

It is not a novel idea, and several older papers discuss it, see page 35,36 for a description:
http://www.continuousphysics.com/ftp/pu ... 001-06.pdf

More recent GPU implementation (master thesis):
http://www.continuousphysics.com/ftp/pu ... thesis.pdf

Typical use case is collision detection and raycasting against static triangle meshes (preferably using batching and double/multi-buffering to hide the latency of the DMA transfer/L2 cache misses).