Bullet on GPU

Erwin Coumans · Post by **Erwin Coumans** » Mon Sep 18, 2006 8:29 pm

From what I've heard, Havok does the other way around, narrowphase on GPU, broadphase on CPU, and that has proved to work very fast on both Nvidia and ATI, using Cg I think. I'm not sure how Cg compares to GLSL.

Broadphase is an N*N problem, with random access, and works very fast on CPU.

Reading back AABBs for all objects is less then 500Kb for 16.000 objects. Adding transforms makes it much bigger. Multiple megabytes might be too much for readback

Narrowphase is the actual contact point calculation. That is actually the main thing that I wanted to run on the GPU, next to applying the physics response (applying impulse etc.) and rendering. I still hope we can keep the render from GPU as option, next to rendering from CPU.

SteveBaker · Post by **SteveBaker** » Mon Sep 18, 2006 11:59 pm

Erwin Coumans wrote:From what I've heard, Havok does the other way around, narrowphase on GPU, broadphase on CPU, and that has proved to work very fast on both Nvidia and ATI, using Cg I think. I'm not sure how Cg compares to GLSL.

Cg and GLSL are essentially identical. There are annoying little syntactic differences - but nothing significant.

Broadphase is an N*N problem, with random access, and works very fast on CPU.

So with thousands of objects it's less than (say) 0.1msec?

Reading back AABBs for all objects is less then 500Kb for 16.000 objects. Adding transforms makes it much bigger. Multiple megabytes might be too much for readback :)

Well, an AABB is three floats for one vertex and either three floats for size or three for the opposite diagonal vertex. Six floats total. A transform is three floats for the translate and four for a quaternion rotate. Seven floats total. However, the GPU basically works in units of four 'things' so the actual practical difference between 6 and 7 isn't that much.

Narrowphase is the actual contact point calculation. That is actually the main thing that I wanted to run on the GPU, next to applying the physics response (applying impulse etc.) and rendering. I still hope we can keep the render from GPU as option, next to rendering from CPU.

Correct me if I'm wrong - but isn't Narrowphase going to be a big ugly pile of special cases (triangle versus triangle intersections are *nasty* to do) - performed on the relatively small number of objects that passed the previous two phases? That's the opposite of what the GPU is good at. The GPU can do low-complexity algorithms on VAST numbers of objects in parallel - but high-complexity algorithms on just a few objects is a better job for the CPU (especially if there are loops and/or conditionals involved - those things are death for SIMD machines like the GPU).

However, I truly don't understand what's going on under the hood here - so please educate me!

If we actually had 16,000 randomly moving cubes, do you have a feel for how many might be in broad, mid and narrow phases and how much CPU time each of those three steps takes?

My feeling was that Broadphase would have to do very rough comparisons on 16000 objects (which could be an NxN problem unless we're smart about it) - then if there were maybe 1000 near-collisions (overlapping AABB's perhaps) detected, midphase would do more detailed testing on 1000 collision pairs - of which a fair number might be rejected because the AABB's overlap - but the actual objects don't. Then the narrowphase would figure the consequences of those collisions on maybe a couple of hundred objects. My gut was telling me that Broadphase would be pretty mindless repetition - Midphase would be more sophisticated (but on maybe only 10% of the objects) and narrowphase would be seriously complicated code - but running on a couple of handfuls of objects.

If that 'gut feel' is right - then Broadphase is the perfect job for the GPU. Midphase maybe appropriate and narrow phase would be a definite job for the CPU.

What's wrong with this view of the process?

Erwin Coumans · Post by **Erwin Coumans** » Tue Sep 19, 2006 12:18 am

Correct me if I'm wrong - but isn't Narrowphase going to be a big ugly pile of special cases (triangle versus triangle intersections are *nasty* to do) - performed on the relatively small number of objects that passed the previous two phases? That's the opposite of what the GPU is good at. The GPU can do low-complexity algorithms on VAST numbers of objects in parallel - but high-complexity algorithms on just a few objects is a better job for the CPU (especially if there are loops and/or conditionals involved - those things are death for SIMD machines like the GPU).

However, I truly don't understand what's going on under the hood here - so please educate me!

If we actually had 16,000 randomly moving cubes, do you have a feel for how many might be in broad, mid and narrow phases and how much CPU time each of those three steps takes?

My feeling was that Broadphase would have to do very rough comparisons on 16000 objects (which could be an NxN problem unless we're smart about it) - then if there were maybe 1000 near-collisions (overlapping AABB's perhaps) detected, midphase would do more detailed testing on 1000 collision pairs - of which a fair number might be rejected because the AABB's overlap - but the actual objects don't. Then the narrowphase would figure the consequences of those collisions on maybe a couple of hundred objects. My gut was telling me that Broadphase would be pretty mindless repetition - Midphase would be more sophisticated (but on maybe only 10% of the objects) and narrowphase would be seriously complicated code - but running on a couple of handfuls of objects.

If that 'gut feel' is right - then Broadphase is the perfect job for the GPU. Midphase maybe appropriate and narrow phase would be a definite job for the CPU.

What's wrong with this view of the process?

Broadphase is likely very bad for GPU because it requires random access. How do you want to solve this on a GPU? Wouldn't it require N passes for N objects?

On the CPU you just use Sweep and Prune, which is an incremental sort. It is already implemented in Bullet so work is done. This is not brute-force N*N but almost constant time, when there is a lot of coherence. With less coherence it become more expensive, say linear time. When we experience extremely chaotic motion, we might need to add some pre-sorting like radix sort, as Pierre Terdiman described or we can use a hash grid.

For most scenes each rigidbody has a very small number of overlapping neighbours. So this N objects relates to O(N) narrowphase units, with a small constant. If we approximate object with simple shapes (sphere or multi-sphere) the narrowphase is simple. The first goal would be to get a prototype working with approximated shapes. We can approximate moving objects by a voxelization of spheres, call this a compound. Then for two compounds, the midphase selects the touching spheres. Those will be passed to the narrowphase.

In collision detection/physics we typically try to avoid the complex cases like concave-versus-concave and working with moving triangles. Moving objects are best represented by (compounds of) convex objects with volume. This helps contact generation and penetration depth estimation.
So working with moving tetrahedra might be a better start to add detail, once the sphere cases are working. We could use either separating axis based (SAT) tetrahedron-versus-tetrahedron, or gjk based with inline support mapping for a tetrahedron. But let's focus on the sphere case first.

SteveBaker · Post by **SteveBaker** » Tue Sep 19, 2006 2:01 am

Broadphase is likely very bad for GPU because it requires random access. How do you want to solve this on a GPU? Wouldn't it require N passes for N objects?

What I would do to compare 16000 AABB's with 16000 AABB's would be to test 1 AABB against 16000 of them in parallel. That could be done by drawing a single 128x128 pixel polygon. To test 16000 against 16000 would require drawing 16000 polygons - but because it's all data driven, all of that looping can happen on the GPU and we just issue one command and then fetch the results.

So, we'd store the AABB coordinates for all of the objects in a texture - we already have the rotations and translations stored that way. Now, we draw one polygon (128x128 pixels) for every one of those 16000 polygons. This sounds expensive - but 16,000 polygons on hardware that can probably draw 32 million per second isn't really a lot. OK - so we pass the system the AABB for one 'probe' polygon. The parallel rendering box can then use awesome parallelism to compare that AABB with every single other AABB in one step. The result is 16000 collision results (stored in another texture). By progressively merging the results of each 'probe' AABB in turn, you'd end up with a texture that would contain a list of all of the 'probe' AABB's that collided with each of the 'target' AABB's.

The only snag is that we only have four numbers we can write to on the output (Red, Green, Blue and Alpha) - so recording more than four collisions against any one particular AABB is a bit tricky. So more than one pass might be required if lots of AABB's are likely to hit one particular AABB.

Brute force - but not slow....I hope!

Once I nail down our GPU API so it actually works for you as well as me - I'll try to add cube-on-cube collisions into my demo.

On the CPU you just use Sweep and Prune, which is an incremental sort. It is already implemented in Bullet so work is done. This is not brute-force N*N but almost constant time, when there is a lot of coherence. With less coherence it become more expensive, say linear time. When we experience extremely chaotic motion, we might need to add some pre-sorting like radix sort, as Pierre Terdiman described or we can use a hash grid.

Yeah - so you can take advantage of frame to frame coherence...OK.

I guess that worst-case, that can be pretty terrible - but best case could be more efficient than what I propose. <shrug>

For most scenes each rigidbody has a very small number of overlapping neighbours. So this N objects relates to O(N) narrowphase units, with a small constant. If we approximate object with simple shapes (sphere or multi-sphere) the narrowphase is simple. The first goal would be to get a prototype working with approximated shapes. We can approximate moving objects by a voxelization of spheres, call this a compound. Then for two compounds, the midphase selects the touching spheres. Those will be passed to the narrowphase.

In collision detection/physics we typically try to avoid the complex cases like concave-versus-concave and working with moving triangles. Moving objects are best represented by (compounds of) convex objects with volume. This helps contact generation and penetration depth estimation.
So working with moving tetrahedra might be a better start to add detail, once the sphere cases are working. We could use either separating axis based (SAT) tetrahedron-versus-tetrahedron, or gjk based with inline support mapping for a tetrahedron. But let's focus on the sphere case first.

Sphere/Sphere is certainly very easy - but I wouldn't want to start heading down a path that suddenly drops off a cliff when we discover that (say) cube/triangle cases are just too complex for the teeny-tiny GPU computers. The processors in these things are designed to run programs with one or two hundred machine code instructions - most shader programs are maybe 10 lines long.

You really have to think differently in shader-land though. Take an example I'm familiar with - triangle/triangle intersections: One good first trick on the CPU is to substitute the vertices of one triangle into the plane equation of the other - if they all come out with the same sign - then the triangles don't intersect and you can early-out. Since this is a common case, it's well worth testing for.

But on the GPU, all 48 or so processors have to execute the exact same instruction stream. So if even one of the 48 triangles we are testing against fails this test, all 48 processors have to idle along while that one processor goes the long, slow way through the code.

What this means is that the cost of the early-out test actually slowed everyone down - it would have been faster to have not bothered with the early out test and just always gone the "slow" way!

It's really counter-intuitive.

GPU's are best at mindless brute-force attacks on problems. Any attempt at subtlety usually runs more slowly than the K.I.S.S approach...generally.

Erwin Coumans · Post by **Erwin Coumans** » Tue Sep 19, 2006 3:09 am

SteveBaker wrote:
Broadphase is likely very bad for GPU because it requires random access. How do you want to solve this on a GPU? Wouldn't it require N passes for N objects?

The only snag is that we only have four numbers we can write to on the output (Red, Green, Blue and Alpha) - so recording more than four collisions against any one particular AABB is a bit tricky. So more than one pass might be required if lots of AABB's are likely to hit one particular AABB.

Brute force - but not slow....I hope!

Ok, we can compare the CPU Sweep and prune with your GPU broadphase. In practice, there is always a lot of coherence. If your world is big enough, and timestep small (60 hertz), objects don't move much each timestep. But perhaps debris behaves differently

But on the GPU, all 48 or so processors have to execute the exact same instruction stream. So if even one of the 48 triangles we are testing against fails this test, all 48 processors have to idle along while that one processor goes the long, slow way through the code.

Yeah, optimizing for next-gen parallel processors is challenging but very interesting and rewarding. On the Playstation 3 SPU's it is fairly common to calculate all results, and get rid of the conditional if's with a 'select' instruction.

I just added some not-so-brute-force optimization in Bullet. This allows running large scenes a bit better. See these 3000 object scenes, they run a few frames a second. One with spheres, other with cubes:

I'm planning to use Vtune 8 the coming weeks to optimize this nicely.
Erwin

SteveBaker · Post by **SteveBaker** » Tue Sep 19, 2006 4:01 am

Erwin Coumans wrote: I just added some not-so-brute-force optimization in Bullet. This allows running large scenes a bit better. See these 3000 object scenes, they run a few frames a second. One with spheres, other with cubes:

Wow! That's impressive.

So how do you think the time is consumed? Do you have a rough idea of how many milliseconds are spent in broadphase, mid and narrow? What fraction of the frame is spent doing graphics?

It would be nice to know because then I'd have a number to shoot for!

SteveBaker · Post by **SteveBaker** » Tue Sep 19, 2006 4:52 am

ANNOUNCE 0.5 is out!
~~~~~~~~~~~~~~~~~~~~

http://www.sjbaker.org/tmp/GPUphysics-0.5.tgz

This version carefully avoids reading and writing the same texture at the same time.

I suspect that may be why we get grey screens in Windows/nVidia setups.

I also found that nVidia's 76.76 driver revision has a bug that prevents reading back textures that have (??recently??) been written to as an FBO. It appears that the driver keeps a 'backup' copy of the texture in main memory and if you read back the map - you just get what you wrote from the CPU - not what the shaders subsequently changed. That was evidently a bug in that version of the driver because I don't see it on any other revisions. Upgrading the driver on that machine to driver rev 87.74 fixed the problem.

So:

* On ATI hardware (under any OS), the 0.4 and 0.5 versions should make everything work.

* On Windows systems with nVidia hardware, I suspect version 0.4 fails but 0.5 works. If not - then the Windows 'FBO' code is broken and I'm out of ideas!

* On Linux systems with nVidia hardware, everything continues to work well - except for that one '76.76' driver - which works OK so long as you DON'T use the '-v' option!

Urgh! It doesn't pay to be on the cutting edge here! We're using a whole bunch of pretty unusual OpenGL constructs - and in combinations that seem really unusual - so I guess it's not entirely surprising that we're hitting all of the bugs and odd-ball missing features.

I'm going to check out the subversion setup - hopefully, future changes can just go into SVN instead of having these constant announcements and new loads.

DevO · Post by **DevO** » Tue Sep 19, 2006 11:40 am

Physic on GPU is very interesting stuff me.

I have read this thread and tried to compile GPUphysics-0.5 with VC++ 2005.
After solving some problems and fSize() problem, the console always show "WARNING: Incomplete FBO setup" and then on exit "OpenGL Error: invalid operation".
The problem sems to be with

Code: Select all

mass  = new FrameBufferObject ( TEX_SIZE, TEX_SIZE, 1, FBO_FLOAT ) ;

Only with Disabled vertex textures is seems to work but without forces.

It was tested on WinXp with GeForce 7800 and 91.47 driver.

SteveBaker · Post by **SteveBaker** » Tue Sep 19, 2006 1:10 pm

DevO wrote:After solving some problems and fSize() problem...

Can you let me know what those were - I'll make sure they are fixed in the next revision. I can't easily get to a Windows machine to do testing (both work and home are 100% Linux shops!)

the console always show "WARNING: Incomplete FBO setup"

So does mine - it's because I don't use the stencil buffer and don't bind a graphics context to it. I believe this is benign - but I need to address it sometime.

and then on exit "OpenGL Error: invalid operation". The problem sems to be with
Code: Select all
mass  = new FrameBufferObject ( TEX_SIZE, TEX_SIZE, 1, FBO_FLOAT ) ;

So you are saying that it dies with that message?

That's odd because to get to that line it had to have gone through half a dozen nearly identical operations without error. I'll stick in some more OpenGL error tests around that part of the code.

Only with Disabled vertex textures is seems to work but without forces.

Right - with -f, that line of code never gets executed. OK - I'll see if I can narrow this down a bit.

Thanks!

Dragonlord · Post by **Dragonlord** » Tue Sep 19, 2006 3:39 pm

SteveBaker wrote:
the console always show "WARNING: Incomplete FBO setup"
So does mine - it's because I don't use the stencil buffer and don't bind a graphics context to it. I believe this is benign - but I need to address it sometime.

In fact it is not. If the FBO is incomplete ( aka the warning ) then the OpenGL driver is not required to deliver the actions you expect. In fact any driver will behave differently ( the problems you are seeing since some time ). Setting up a complete FBO is tricky but working with an incomplete one will give more troubles than it is worth. I wanted to poke at the code since a couple of days to check out the source of this problem but I didn't find the time

SteveBaker · Post by **SteveBaker** » Tue Sep 19, 2006 4:08 pm

Dragonlord wrote:
SteveBaker wrote:
the console always show "WARNING: Incomplete FBO setup"
So does mine - it's because I don't use the stencil buffer and don't bind a graphics context to it. I believe this is benign - but I need to address it sometime.
In fact it is not. If the FBO is incomplete ( aka the warning ) then the OpenGL driver is not required to deliver the actions you expect. In fact any driver will behave differently ( the problems you are seeing since some time ). Setting up a complete FBO is tricky but working with an incomplete one will give more troubles than it is worth. I wanted to poke at the code since a couple of days to check out the source of this problem but I didn't find the time :cry:

"Tricky" aint the word!

Making the FBO come out 'complete' seems to be well-neigh impossible for portable code. The same hardware with different driver revisions, the same hardware AND driver under different OS's - or different hardware with the same driver and OS....all of these seem to demand different combinations of non-obvious settings. There was a program someone posted on one of the OpenGL group forums that exhaustively trys all combinations of texture depth and type and the list of what works is small, seemingly arbitary and very changeable from one setup to another.

On my Linux/nVidia box, things always seem to work no matter whether I get an incomplete FBO or not - and I'm beginning to suspect that this is related to the stencil buffer. I had initially simply not provided one...that sometimes works (no 'INCOMPLETE' error on Linux/76.76 drivers/6800GT for example), sometimes produces a warning but works anyway (Linux81.78/6800 Ultra for example), and sometimes produces the message and doesn't work (Windows/??/6800?? for example). I tried providing one - and that swaps around what works and what doesn't. What seems to be working is to set glStencilMask(0);glDisable(GL_STENCIL_TEST); and not provide one...but only time will tell if that is the issue. However, even that only gets rid of the message for full colour FBO's. For a monochrome FBO, I still get complaints...but it works anyway?!?

I could understand it complaining if something isn't supported - but when it actually works despite the message - why is it bitching at me? I could understand if it were telling me "yes, it works here - but it's not necessarily going to work elsewhere" - but even when it doesn't say 'INCOMPLETE_XXX' (so it's working!) - it won't always work elsewhere.

What's worse is that I took my Mark I version by cutting and pasting code from the OpenGL extension specification document! If that doesn't work, we're doomed!

So basically, it's just a mess!

Anyway - I'll try putting out a version for test that disables stencil (and depth for good measure) - and which always uses a full colour FBO - even when a monochrome one would do the job.

Dragonlord · Post by **Dragonlord** » Tue Sep 19, 2006 4:58 pm

SteveBaker wrote:"Tricky" aint the word!

Making the FBO come out 'complete' seems to be well-neigh impossible for portable code.

If the drivers hold true to the OpenGL specifications then it is well defined what is complete and what not. I know I once thought too this is not the case until I got teached better by somebody over at GameDev.net . Unless you do things not mentioned in the specs as valid it comes out valid unless you do something wronge ( and yep, I did things wrong I didn't consider wrong ).

On my Linux/nVidia box, things always seem to work no matter whether I get an incomplete FBO or not - and I'm beginning to suspect that this is related to the stencil buffer.

The problem is here that the linux drivers are rather "lax" on how they interprate the specs at some places. I had similar troubles moving my code to windows until I found out that the linux drivers ( ATI/nVidia ) are sometimes too lax or don't check for certain incompletness which the ( more developed unfortunatly ) windows drivers do. Hence it is understandable that under linux you won't run into the troubles you have on windows. The reason I bought myself a second machine dedicated to windows to cope with this annoyances.

I could understand it complaining if something isn't supported - but when it actually works despite the message - why is it bitching at me?

Because there are certain "bugs" in the linux drivers. For example ATI drivers tell you that there is support for format XYZ and if you try it it's incomplete because in fact the format "is not" supported. I required a long time of swearing until I found out where the problem is ( sometimes even ARBs are listed by the driver that are in fact not working: example Rectangle ARB on ATI doesn't work on all cards although listed by all ).

I could understand if it were telling me "yes, it works here - but it's not necessarily going to work elsewhere" - but even when it doesn't say 'INCOMPLETE_XXX' (so it's working!) - it won't always work elsewhere.

OpenGL is nice but has one big flaw: worse error messages than windows in general has. If it tells you that X is incomplete than it only means that at the top of the error-chain there is X but the true bugger is somewhere in the tree. I had a couple of sleepless nights due to this :/

What's worse is that I took my Mark I version by cutting and pasting code from the OpenGL extension specification document! If that doesn't work, we're doomed!

I am not sure about this one but I think there are some bugs in the examples in the specs. I did the same thing and hit a wall until I had some talks with people at GameDev.net and fiddeled around with stuff I never expected to be the problem. That's the only thing I really hate on the specs: not always working examples.

So basically, it's just a mess!

Yes, until you managed to get a working version. But where this is not the case.

I'll try to look at the code tomorrow if possible. I am just curious if I stumble across some troubles that I know.

- Pl?ss Roland

Erwin Coumans · Post by **Erwin Coumans** » Tue Sep 19, 2006 5:08 pm

Thanks Steven,

Subversion version in Extras/GPUphysics is already updated with latest 0.5 version including Windows fixes. Projectfiles will be autogenerated soon, then it will compile out-of-the-box for Visual Studio 6,7,7.1,8.
The issues were:

Code: Select all

//multiplatform way of detection file size:
int size = 0;
	/* File operations denied? ok, just close and return failure */
	if (fseek(fd, 0, SEEK_END) || (size = ftell(fd)) == EOF || fseek(fd, 0, SEEK_SET)) 
	{
		printf("Error: cannot get filesize from %s\n", fname);
		exit (1);
	}

Code: Select all

//don't add "ra", but just "r"
FILE *fd = fopen ( fname, "r" ) ;

Code: Select all

//don't use random(), use rand() instead

A quick test on Nvidia 6800 gives

-c and -p works fine, displays lots of tiny cubes, and return

Code: Select all

INFO: This hardware supports at most:
   4 vert  texture samplers
  16 frag  texture samplers
  16 total texture samplers

without arguments, or with argument -c screen is grey, and console gives

Code: Select all

INFO: This hardware supports at most:
   4 vert  texture samplers
  16 frag  texture samplers
  16 total texture samplers
WARNING: Incomplete FBO setup.

Passing argument -f is grey, without an error (same console output as -s and -p)

Hope this helps,
Erwin

Dragonlord · Post by **Dragonlord** » Tue Sep 19, 2006 5:38 pm

Pushed myself to get it. Compiling works without a problem. I tested it so far only on my old machine ( old card, Ati Radeon 9600 ) as my DevStation is currently in use.

./GPU_physics_demo:
complains about incomplete FBO but otherwise renders all. no vertex texture support on this machine but allow me the question "why do we need this anyways?!"

./GPU_physics_demo -s
SEGF:
0x0804adc9 in drawCubesTheHardWay ()
(gdb) bt
#0 0x0804adc9 in drawCubesTheHardWay ()
#1 0x0804b3e5 in drawCubes ()
#2 0x0804b6f3 in display ()
#3 0xb7f7c0bc in __glutRegisterEventParser () from /usr/lib/libglut.so.3
#4 0x08994ab0 in ?? ()
#5 0x000001e0 in ?? ()
#6 0x00000000 in ?? ()

./GPU_physics_demo -p:
no fbo complain but
WARNING: If nothing seems to be working, you may
have an old version of the nVidia driver.
Version 76.76 is known to be bad.
woot, I have an ATI box. What has nVidia drivers lost there

./GPU_physics_demo -f:
no complaint about incompletness

./GPU_physics_demo -c:
complains about incompletness

Concerning the code I just poked very quickly at it and did not dig deeper yet as GLUT is involved. GLUT gives me always headaches hence I need to look at this if I have more time at hand as there are much more unknowns then self baked code.

- Pl?ss Roland

DevO · Post by **DevO** » Tue Sep 19, 2006 5:40 pm

This all changes Ihave done to make it compilable with VC 2005.
I will try to test it with XCode latter on Mac.

The "WARNING: Incomplete FBO setup." is becouse there problems with 1 i this line

Code: Select all

mass  = new FrameBufferObject ( TEX_SIZE, TEX_SIZE, 1, FBO_FLOAT ) ;

FrameBufferObject () seems do not work well if you want only Alpha(single laer) textur, but it works with RGB(3 layers).

Tons of "OpenGL Error: invalid operation" are always there .
If you enable NVIDIA difine then you will get "OpenGL Error: <null>".