Basroil wrote:lunkhound wrote:https://github.com/bulletphysics/bullet3/issues/126
I'd like to get Bullet 2 running with CPU multithreading but it has been removed from recent versions of the repository. I looked up the commit that removed it (f06312c632492200ba1478ddc84c5ee474f74e54, comment is "remove most clutter (todo)"), but it doesn't explain why BulletMultithreaded was dropped. Anybody know why it was dropped?
1) It didn't work all that well
2) Bullet3 mostly replaces the functionality
You can still download an older commit easily if you want the code, through it might be better to develop a multithreaded dispatcher and making per-island calculations still single threaded for accuracy
I had a look at the old BulletMultiThreaded code and I can see now why it was dropped. Basically it was designed around the Playstation 3 (asymmetric multiprocessing) architecture, and then retrofitted to work on SMP (Symmetric MP) architectures. It really is quite a mess, it looks like the constraint solver and collision dispatcher have been copied wholesale from Bullet and then heavily modified to work on the memory-limited SPUs. In short, it is not a good starting point for CPU multithreading. Too many things about it are locked in by the PS3-centric design.
So a few days ago I started developing a multithreaded Bullet based on the current Bullet 2 code in GitHub. I have a multithreaded dispatcher working already. The speedup on a 4-core system for the dispatchAllPairs call is about 2.2x. This is using a patched up version of the MultithreadedDemo using Intel TBB (Threading Building Blocks) as a task scheduler.
I'm trying to keep the changes to Bullet to a minimum, and incur no performance loss for the single-threaded version. Basically I'm throwing a few locks in at certain places that only get compiled in if you opt for it.
Also I'm not making Bullet depend on any specific threading library -- the launching of tasks all happens outside of Bullet -- so it can be used with any task scheduler. The demo app depends on TBB, but Bullet libraries do not.
Code is here:
https://github.com/lunkhound/bullet3
Instructions:
- install TBB 4.3 (build if using the open source version)
download my bullet3 fork
run Cmake on bullet
look for and enable the cmake option called BULLET2_USE_THREAD_LOCKS
look for and enable the cmake option called BULLET2_MULTITHREADED_TBB_DEMO
do a cmake configure (new options should appear)
set the option called BULLET2_TBB_INCLUDE_DIR to the path to the TBB includes directory
set the option called BULLET2_TBB_LIB_DIR to the path to the TBB .lib files (needs tbb.lib and tbbmalloc.lib)
do a cmake generate
open up the resulting solution in Visual Studio (there should be a project called "AppMultiThreadedDemo")
build the MultiThreadedDemo project in "Release"
find the appropriate TBB .dlls (tbb.dll and tbbmalloc.dll) for your version of Visual Studio and manually copy them into the same directory as AppMultiThreadedDemo.exe
Now you should be able to run the demo. Pay attention to the numbers after "collision detection time" on screen. Those are the numbers that should be improved compared to a single-threaded version. To compile a single-threaded version for comparison, edit MultiThreadedDemo.cpp, and change USE_PARALLEL_DISPATCHER to 0.
Oh, and keep in mind that in this particular demo, most of the time is going towards solving constraints, and that part is still completely single threaded.
Other caveats:
I've only tested this on Windows 7 with MSVC 2013. There is some pthreads code in btThreads.cpp for supporting OSX and Linux but I haven't even tried to compile it.
Bullet's built-in profiling isn't threadsafe, so it gets automatically disabled (with BT_NO_PROFILE) by Cmake when you enable the "USE_THREAD_LOCKS" option.
The Cmake TBB integration is pretty rough. For the debug build, the demo should link with tbb_debug.lib and tbbmalloc_debug.lib rather than tbb.lib and tbbmalloc.lib, but I couldn't figure out how to do that via Cmake.
I also hate that you have to manually copy the tbb dlls. I'm not sure how to do that in Cmake either.
I've also been working on threading the constraint solver and part of it is now working (sequential impulse solver, ConvertContacts). The code for that part isn't up on GitHub just yet.
I don't think the one-task-per island is the best approach however, the performance improvement would be heavily dependent on the composition of islands. For example, if everything is clumped into one large island, it would drop to single-threaded performance. So I'm trying to make the tasks finer-grained than one-per-island.