Broadphase is likely very bad for GPU because it requires random access. How do you want to solve this on a GPU? Wouldn't it require N passes for N objects?
What I would do to compare 16000 AABB's with 16000 AABB's would be to test 1 AABB against 16000 of them in parallel. That could be done by drawing a single 128x128 pixel polygon. To test 16000 against 16000 would require drawing 16000 polygons - but because it's all data driven, all of that looping can happen on the GPU and we just issue one command and then fetch the results.
So, we'd store the AABB coordinates for all of the objects in a texture - we already have the rotations and translations stored that way. Now, we draw one polygon (128x128 pixels) for every one of those 16000 polygons. This sounds expensive - but 16,000 polygons on hardware that can probably draw 32 million per second isn't really a lot. OK - so we pass the system the AABB for one 'probe' polygon. The parallel rendering box can then use awesome parallelism to compare that AABB with every single other AABB in one step. The result is 16000 collision results (stored in another texture). By progressively merging the results of each 'probe' AABB in turn, you'd end up with a texture that would contain a list of all of the 'probe' AABB's that collided with each of the 'target' AABB's.
The only snag is that we only have four numbers we can write to on the output (Red, Green, Blue and Alpha) - so recording more than four collisions against any one particular AABB is a bit tricky. So more than one pass might be required if lots of AABB's are likely to hit one particular AABB.
Brute force - but not slow....I hope!
Once I nail down our GPU API so it actually works for you as well as me - I'll try to add cube-on-cube collisions into my demo.
On the CPU you just use Sweep and Prune, which is an incremental sort. It is already implemented in Bullet so work is done. This is not brute-force N*N but almost constant time, when there is a lot of coherence. With less coherence it become more expensive, say linear time. When we experience extremely chaotic motion, we might need to add some pre-sorting like radix sort, as
Pierre Terdiman described or we can use a hash grid.
Yeah - so you can take advantage of frame to frame coherence...OK.
I guess that worst-case, that can be pretty terrible - but best case could be more efficient than what I propose. <shrug>
For most scenes each rigidbody has a very small number of overlapping neighbours. So this N objects relates to O(N) narrowphase units, with a small constant. If we approximate object with simple shapes (sphere or multi-sphere) the narrowphase is simple. The first goal would be to get a prototype working with approximated shapes. We can approximate moving objects by a voxelization of spheres, call this a compound. Then for two compounds, the midphase selects the touching spheres. Those will be passed to the narrowphase.
In collision detection/physics we typically try to avoid the complex cases like concave-versus-concave and working with moving triangles. Moving objects are best represented by (compounds of) convex objects with volume. This helps contact generation and penetration depth estimation.
So working with moving tetrahedra might be a better start to add detail, once the sphere cases are working. We could use either separating axis based (SAT) tetrahedron-versus-tetrahedron, or gjk based with inline support mapping for a tetrahedron. But let's focus on the sphere case first.
Sphere/Sphere is certainly very easy - but I wouldn't want to start heading down a path that suddenly drops off a cliff when we discover that (say) cube/triangle cases are just too complex for the teeny-tiny GPU computers. The processors in these things are designed to run programs with one or two hundred machine code instructions - most shader programs are maybe 10 lines long.
You really have to think differently in shader-land though. Take an example I'm familiar with - triangle/triangle intersections: One good first trick on the CPU is to substitute the vertices of one triangle into the plane equation of the other - if they all come out with the same sign - then the triangles don't intersect and you can early-out. Since this is a common case, it's well worth testing for.
But on the GPU, all 48 or so processors have to execute the exact same instruction stream. So if even one of the 48 triangles we are testing against fails this test, all 48 processors have to idle along while that one processor goes the long, slow way through the code.
What this means is that the cost of the early-out test actually slowed everyone down - it would have been faster to have not bothered with the early out test and just always gone the "slow" way!
It's really counter-intuitive.
GPU's are best at mindless brute-force attacks on problems. Any attempt at subtlety usually runs more slowly than the K.I.S.S approach...generally.