Page 1 of 1

Performance tips? (OpenCL/multi-threading/etc)

Posted: Sat Jun 10, 2017 9:54 pm
by ivanisavich
I'm migrating my physics code for an application I'm working on from PhysX to Bullet, to do some performance testing. My application uses rigidbodies, constraints and cloth.

Initially I was under the impression that Bullet supports OpenCL computations, which greatly improve simulation performance...but digging a little more on the forum I found some posts that implied OpenCL in bullet hasn't been developed in years and is very much experimental. Has anyone implemented it recently? Is it stable? Does it work with cloth and constraints or only rigidbodies?

Also, I enabled bullet multithreading in my app and get moderate performance gains (about 10-15% better performance when using 12 threads instead of 1, on my 16 core machine)....but the simulations are still pretty slow.

As a comparison, when simulating 10k spheres dropping onto a ground plane, my PhysX sim runs more than twice as fast as Bullet.

Are there any other obvious ways to speed up rigidbody simulations? Any hidden precompiler flags I can enable, etc? I'm pretty new to bullet and so I don't have a good grasp of all possible tips and tricks yet, and would appreciate any advice!

Re: Performance tips? (OpenCL/multi-threading/etc)

Posted: Tue Jun 13, 2017 12:10 am
by lunkhound
ivanisavich wrote:I'm migrating my physics code for an application I'm working on from PhysX to Bullet, to do some performance testing. My application uses rigidbodies, constraints and cloth.

Initially I was under the impression that Bullet supports OpenCL computations, which greatly improve simulation performance...but digging a little more on the forum I found some posts that implied OpenCL in bullet hasn't been developed in years and is very much experimental. Has anyone implemented it recently? Is it stable? Does it work with cloth and constraints or only rigidbodies?
I don't know about the OpenCL stuff.
ivanisavich wrote:Also, I enabled bullet multithreading in my app and get moderate performance gains (about 10-15% better performance when using 12 threads instead of 1, on my 16 core machine)....but the simulations are still pretty slow.
Is this a recent version of bullet?
If so:
- are you using the DiscreteDynamicsWorldMt as your physics world?
- which task scheduler are you using?
- are most of the bodies in your simulation world bunched together? or are they spread out?
ivanisavich wrote:As a comparison, when simulating 10k spheres dropping onto a ground plane, my PhysX sim runs more than twice as fast as Bullet.

Are there any other obvious ways to speed up rigidbody simulations? Any hidden precompiler flags I can enable, etc? I'm pretty new to bullet and so I don't have a good grasp of all possible tips and tricks yet, and would appreciate any advice!
If you check out the example browser in the current bullet, and look at the "MultiThreadedDemo" under the "experiments" section. This is kind of a best-case scenario for showing what kind of speedup is possible with multithreading in bullet right now.
The demo has 48 separate stacks of boxes (which can each be solved in parallel). You can adjust the number of stacks by adjusting the "Stack rows" and "Stack columns" sliders.
In this demo you can tweak various options, like changing the task scheduler, the thread count, the solver flags, and see the effect on performance.
There is a built-in profiler:
- if you click on the "Display solver info" button, you should get a summary of profile stats relating to multithreading which can give some insight into where the time is going
- if you click on "View -> Profiler" you'll get a window with real-time detailed profiling stats for the main thread only
- if you press 'p' and hold it down for a half-second or so, a file called "timings_0.json" will be written to a folder somewhere (on Windows/Cmake/MSVC it is the same folder as the App_Examplebrowser.vcxproj file)

If you use the chrome web browser, and put "about://tracing" into the address bar, you can use it to load the json file to view timings for each thread.

Re: Performance tips? (OpenCL/multi-threading/etc)

Posted: Thu Jun 13, 2019 7:04 am
by SnapperTT
First of all thankyou lunkhound, for your work on implementing MP Bullet.

Like the OP I am also experiencing very modest performance boosts - looking at the MultiThreaded Demo in the example browser, with 12 threads (6 cores) I am getting a ~20% boost in performance for about a 400% increase in total CPU time.
ThreadSupport, No MT: - 32ms
ThreadSupport, MT enabled w/ 12 threads: ~26ms
OpenMP, MT enabled: 38ms (!!)
Running with Bullet 2.88. This ratio is roughly consistant even when cranking the box counts up.

I'm curious as I'm considering my own implementation of multithreading - dividing the simulation into many btDynamicWorlds and giving them each a thread. If the simulation can be sensibly divided up such that there is little overlapping regions and objects stay within their worlds there should be a linear (or better) speedup with world count. For instance in a city simulation objects in different neighborhoods can be grouped into a single btDynamicWorld as objects rarely transition between neighborhoods.
But from reading this thread (viewtopic.php?f=9&t=10232) it appears that this is basically what you are doing within one world, and are creating regions on the fly.

Do you have any thoughts as what can be done either to get the kinds of perf boosts with DiscreteDynamicsWorldMt as reported in the thread, or whether or not my approach is sensible?

Re: Performance tips? (OpenCL/multi-threading/etc)

Posted: Tue Oct 08, 2019 9:14 am
by Dundo
SnapperTT wrote:
Thu Jun 13, 2019 7:04 am
Like the OP I am also experiencing very modest performance boosts - looking at the MultiThreaded Demo in the example browser, with 12 threads (6 cores) I am getting a ~20% boost in performance for about a 400% increase in total CPU time.
ThreadSupport, No MT: - 32ms
ThreadSupport, MT enabled w/ 12 threads: ~26ms
OpenMP, MT enabled: 38ms (!!)
Running with Bullet 2.88. This ratio is roughly consistant even when cranking the box counts up.
Hello, I have the exact same results.
Interestingly OpenMP is much slower than single threaded set up only when GUI is not rendered (renderGui set to false in OpenGLExampleBrower.cpp).
It worth noting that with GUI disabled there is a much more difference between MT and non MT: I can gain ~100%!
Here is some screenshot of the tests that I've done.
You can read the legend in this way:
Two token separated by _, the first token is SOLVER_TYPE, the second token is TASK_SCHEDULER.
Solver Types:
SIMT: SequentialImpulseMT
SI: SequentialImpulse
NNCG: NNCT

Task Scheduler:
TS: ThreadSupport (btCreateDefaultTaskScheduler())
OMP: OpenMP (btGetOpenMPTaskScheduler())
PPL: PPL (btGetPPLTaskScheduler())
ST: SingleThreaded (MULTITHREADED_WORLD_ENABLE is false)

All Tests are done adjusting the parameters in "Benchmarks/Prim vs Mesh" demo, commenting "sCurrentDemo->renderScene()" function in OpenGLExampleBrowser.cpp to disable rendering.

Image