Here is a screenshot of what the new version of the demo looks like:
bullet-example-browser.jpg
Here is the profile view in single threaded mode, followed by multi-threaded:
bullet-example-browser-single-threaded.jpg
bullet-example-browser-tbb.jpg
My machine has a quad-core CPU with hyperthreading, so by default I'm using 8 threads, but I wouldn't expect to get more than a 4x speed improvement.
So solveGroup (the constraint solver) went from 132.9ms to 32.6ms (roughly 4 times faster). That's probably the most significant change right there. In fact you'll notice that in single-threaded mode it gets called 96 times, while in the multithreaded run it is only called 12 times. That's because each of the 8 threads is calling it 12 times but the profiler is only recording timing for the main thread.
Another area that shows significant speedup is dispatchAllCollisionPairs (narrowphase collision detection). That went from 33.6ms to 8.3ms (also a 4x speedup).
Other areas that show a speedup are predictUnconstraintMotion (3.07ms to 0.83ms), createPredictiveContacts (2.05ms to 0.41ms), and integrateTransforms (2.90ms to 1.00ms).
A few areas that are not parallelized at all are updateAabbs (the broadphase collision), calculateSimulationIslands (the process of generating the simulation islands), synchronizeMotionStates, and a few others.
The net effect in this case is that the stepSimulation goes from 184.7ms to 55.2ms, an overall speedup of around 3.3x. These are the same performance results as before, the main difference is that now the built-in profiling can be used to see it.
You do not have the required permissions to view the files attached to this post.