My exotic approach to GPU collision detection

Erwin Coumans · Post by **Erwin Coumans** » Sun Oct 01, 2006 7:23 pm

It all depends on the setup.

Usually the broadphase takes out most of the pairs, so 1024 sphere will never be a n^2 problem.

You can modify the Bullet CcdPhysicsDemo, to increase the stack of objects to 1024, change this line (release mode)
const int gNumObjects = 120;
, and change the shape in btSphereShape, and enable profiling.
Profiling can be enabled by editing Bullet/src/LinearMath/btQuickprof.h,
uncomment the line at the top //#define USE_QUICKPROF 1
Then recompile the sources and press 'p'. It will show profiling on-screen, and ouput the data to a comma separated file. Perhaps I should enable profiling by default, it doesn't take much cpu cycles probably (and its output still needs to be enabled by the user anyway).

By default, that CcdPhysicsDemo (only that demo) registers a very basic SphereSphere collision algorithm (see REGISTER_CUSTOM_COLLISION_ALGORITHM):
http://svn.sourceforge.net/viewvc/bulle ... iew=markup

Thanks,
Erwin

Jack · Post by **Jack** » Sun Oct 01, 2006 7:57 pm

Erwin Coumans wrote:It all depends on the setup.

Usually the broadphase takes out most of the pairs, so 1024 sphere will never be a n^2 problem.

But broadphase is n^2 (exact (n^2 - n)/2). No?
And it is not faster then sphere-sphere. Right?
So, it is only a broadphase will eat 16 ms for 1024 objects....

Erwin Coumans · Post by **Erwin Coumans** » Sun Oct 01, 2006 8:02 pm

Jack wrote:
Erwin Coumans wrote:It all depends on the setup.

Usually the broadphase takes out most of the pairs, so 1024 sphere will never be a n^2 problem.

But broadphase is n^2 (exact (n^2 - n)/2). No?
And it is not faster then sphere-sphere. Right?
So, it is only a broadphase will eat 16 ms for 1024 objects....

No,

The CPU sweep and prune broadphase that Bullet uses is much cheaper. With typical motion (framecoherence) the cost roughly depends on the number of moving objects. For n objects, the cost is around O(n), but with small amount of moving objects it can be cheaper like O(m).

So it makes sense to run the broadphase on the CPU, and then upload a CollisionMap texture to GPU.

Erwin

Jack · Post by **Jack** » Mon Oct 02, 2006 10:42 am

I took a look at your code. I am not sure, but I suspect that you send collision pairs to GPU one by one.

Code: Select all

glBeginQueryARB ( GL_SAMPLES_PASSED_ARB, query ) ;
glMultiDrawArraysEXT ( GL_QUADS, (GLint*)& collstart, (GLint*)& colllength,  1 ) ;
numPasses++ ;
glEndQueryARB   ( GL_SAMPLES_PASSED_ARB ) ;

How many times above code is executed per frame?

SteveBaker · Post by **SteveBaker** » Mon Oct 02, 2006 4:11 pm

Jack wrote:I took a look at your code. I am not sure, but I suspect that you send collision pairs to GPU one by one.
Code: Select all
glBeginQueryARB ( GL_SAMPLES_PASSED_ARB, query ) ;
glMultiDrawArraysEXT ( GL_QUADS, (GLint*)& collstart, (GLint*)& colllength,  1 ) ;
numPasses++ ;
glEndQueryARB   ( GL_SAMPLES_PASSED_ARB ) ;
How many times above code is executed per frame?

Typically once or twice only per frame - irrespective of the number of objects.

The deal is that drawing one polygon compares one object against ALL of the other objects using the frag shader - so there is a lot of parallelism. The glMultiDrawArrays call sends one polygon for every object (in a VBO for speed) - so in this formulation (which is just for the demo) it's an N^2 calculation - but performed as N polygons each of N pixels. The amount of parallelism you have is determined by the number of fragment shaders in your GPU.

The snag is that there is potentially insufficient storage space in the shader's output buffer to store ALL of the collision pairs that one 'target' object might undergo - so I count the number of pixels actually drawn (which equates to the number of collision pairs generated) and loop around generating different ones each pass until no more are drawn. The number of iterations of that loop depends on the worst case number of collisions that one object might undergo divided by the number of collisions you can store in one pass (which is 1 right now - but can be more in future).

This isn't great - but it's a reasonable 'proof of concept' - which is about where we are with this little experiment right now.

I tried measuring the number of collisions produced by each object in turn - in order that second and subsequent passes would require me only to re-send those objects that actually collided with something on the previous iteration - but the overhead of counting the pixel hits for each polygon drawn using the 'query' mechanism actually took longer than just drawing all of the polygons each time and testing whether any of them generated a collision...so simple is better as is so often the case with GPU work.

Probably the best solution of all will be to batch them into groups of a few hundred polygons - but we have a pretty good idea of what this can do now and that level of final polish will only be needed if/when we ever make a GPU-based Bullet that uses GPU collision detection.

Right now, we are really just seeing what can be done and where the big speedups are.

Jack · Post by **Jack** » Mon Oct 02, 2006 4:57 pm

SteveBaker wrote:
Jack wrote:I took a look at your code. I am not sure, but I suspect that you send collision pairs to GPU one by one.
Code: Select all
glBeginQueryARB ( GL_SAMPLES_PASSED_ARB, query ) ;
glMultiDrawArraysEXT ( GL_QUADS, (GLint*)& collstart, (GLint*)& colllength,  1 ) ;
numPasses++ ;
glEndQueryARB   ( GL_SAMPLES_PASSED_ARB ) ;
How many times above code is executed per frame?
Typically once or twice only per frame - irrespective of the number of objects.

The deal is that drawing one polygon compares one object against ALL of the other objects using the frag shader - so there is a lot of parallelism. The glMultiDrawArrays call sends one polygon for every object (in a VBO for speed) - so in this formulation (which is just for the demo) it's an N^2 calculation - but performed as N polygons each of N pixels. The amount of parallelism you have is determined by the number of fragment shaders in your GPU.

The snag is that there is potentially insufficient storage space in the shader's output buffer to store ALL of the collision pairs that one 'target' object might undergo - so I count the number of pixels actually drawn (which equates to the number of collision pairs generated) and loop around generating different ones each pass until no more are drawn. The number of iterations of that loop depends on the worst case number of collisions that one object might undergo divided by the number of collisions you can store in one pass (which is 1 right now - but can be more in future).

This isn't great - but it's a reasonable 'proof of concept' - which is about where we are with this little experiment right now.

I tried measuring the number of collisions produced by each object in turn - in order that second and subsequent passes would require me only to re-send those objects that actually collided with something on the previous iteration - but the overhead of counting the pixel hits for each polygon drawn using the 'query' mechanism actually took longer than just drawing all of the polygons each time and testing whether any of them generated a collision...so simple is better as is so often the case with GPU work.

Probably the best solution of all will be to batch them into groups of a few hundred polygons - but we have a pretty good idea of what this can do now and that level of final polish will only be needed if/when we ever make a GPU-based Bullet that uses GPU collision detection.

Right now, we are really just seeing what can be done and where the big speedups are.

1) I see now. You execute it N times. Each execution may be repeated several times (bet it is 1 in most cases). So you make N GPU calls. It is not good. You lose on this! But until Shader Model 4.0 you can do nothing, I think.
2) Why N^2. You have (N^2-N)/2 pairs. You need something like:
for(i=0; i<TOTAL; i++)
for(j=i+1; j<TOTAL; j++)
{
}
This hepls you to get rid of redundant "IFs" in the fragment shader.
3) "N polygons each of N pixels" probably is not good. Vertex shaders can not fully load fragment shaders in this case.
4) You "polygons" are pionts (D3DPT_POINTLIST). Right?
5) Do not use texture fetch in Vertex Shader. Try to embed all the data you needed in Vertex Declaration.

I think there is no acceptable solution Shader Model 4.0. We need ability to generete variable length output streams (It will be allowed in Geometry Shader). Then collision detection can be done in ONE DrawPrimitive call.

SteveBaker · Post by **SteveBaker** » Mon Oct 02, 2006 7:19 pm

Jack wrote: 1) I see now. You execute it N times. Each execution may be repeated several times (bet it is 1 in most cases). So you make N GPU calls.

I don't know what you mean by "GPU call". For 1024 objects, colliding against 1024 objects, each frame I do ONE call to glMultiDrawArrays which draws 1024 polygons (which are stored off on the graphics card as VBO's). At the end, I read back the number of collision pairs it generated (using the occlusion query). If there were collisions, I have to run that glMultiDrawArrays again.
[/quote]
It is not good. You lose on this! But until Shader Model 4.0 you can do nothing, I think.

The problem is that with only one output FBO, I can't write out all of the collision pairs in one pass because each 'target' object might be hit multiple times. Hence the multiple calls to glMultiDrawArrays...but there won't usually be more than a couple of those per frame for non-pathalogical cases. Shader Model 4 permits me to write to more FBO's - but still there may be pathalogical cases where I can't store *ALL* of the collision pairs. So the multipass approach is still needed - it's just going to consume fewer passes in the future. (There are still some complicated issues associated with even that - but let's not go there now).

2) Why N^2. You have (N^2-N)/2 pairs. You need something like:
for(i=0; i<TOTAL; i++)
for(j=i+1; j<TOTAL; j++)
{
}
This hepls you to get rid of redundant "IFs" in the fragment shader.
For subsequent processing, it'll be easier if I have A-collided-with-B as well as B-collided-with-A...so what I'm doing is still OK. We gotta see the big picture here - it's not just about collision detection!

3) "N polygons each of N pixels" probably is not good. Vertex shaders can not fully load fragment shaders in this case.
Eh? The vertex shader is just about idle. Unless N is less than about 64, the vertex shader can keep up.
4) You "polygons" are pionts (D3DPT_POINTLIST). Right?
I'm using OpenGL - but no, the primitive is GL_QUADS....oh - that's why you think I'm doing it all wrong. Sheesh! Give me more credit than that!
5) Do not use texture fetch in Vertex Shader. Try to embed all the data you needed in Vertex Declaration.
I don't understand what you are saying here...but don't sweat it - I have this right.
I think there is no acceptable solution Shader Model 4.0. We need ability to generete variable length output streams (It will be allowed in Geometry Shader). Then collision detection can be done in ONE DrawPrimitive call.

Well, it's better - but you still need to do more than one pass in pathalogical cases where too many objects can collide with one object. It's also possible that this multipass approach might be faster because I don't think we can selectively kill() writes to individual buffers in Shader Model 4. (I could be wrong about that - but it's an ikky problem).

Jack · Post by **Jack** » Tue Oct 03, 2006 7:10 am

1) You Draw N polygons. Each polygon consists of N pixels. Right?
2) Your render target texture contains only N pixels. Or N*N?
3) If N*N then I do not understand why you need second pass...

For example, in my approach I draw only ONE triangle. Triangle cathetus are N-1 pixels. Area: (N^2-N)/2. This triangle contains all collisions. And I do not use vertex shader at all.

SteveBaker · Post by **SteveBaker** » Tue Oct 03, 2006 2:39 pm

Jack wrote:1) You Draw N polygons. Each polygon consists of N pixels. Right?

Yes.

2) Your render target texture contains only N pixels. Or N*N?

only N.

3) If N*N then I do not understand why you need second pass...

It's not N*N.