CPU multithreading is working!

Mako_energy02
Posts: 170
Joined: Sun Jan 17, 2010 4:47 am

Re: CPU multithreading is working!

Post by Mako_energy02 » Tue Aug 22, 2017 5:44 pm

A few quick clarifications:

This only arguably counts, but I did make sure everything worked with the SequentialTaskScheduler prior to writing my own scheduler. I can confirm everything works fine with it. But...it's not multi-threaded. That said, these issues appear to be race conditions. I'll probably try OpenMP for testing purposes, but I don't consider that to be a long term solution for my project.

Wrapping around would make sense, however I was observing issues on the first or second call to StepSimulation. Assuming it's only incrementing once on thread creation and starts at zero (which I believe is what I saw), then not enough threads are being created prior to the error for that to occur. I could debug further.

I am aware that the BvhTriangleMeshShape can't collide with itself (That was one of the first things I tried doing with bullet years ago). The collisions are boxes and spheres with most being convex hulls. All of the bodies are RigidBodies. When those collisions occur, infinite loop. As I said above, the loop doesn't occur with the SequentialTaskScheduler.

Edit:
I've debugged further and found that I was focusing in on the wrong counter. In the "getNext()" method on the ThreadSafeCounter class you perform a check to see if it has exceeded the maximum supported number of threads and assert if it has. That was the error I was hitting.

lunkhound
Posts: 89
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound » Thu Aug 24, 2017 7:26 pm

If you are hitting that assertion, then it means more than 64 unique threads have been used for doing tasks in bullet. Your task scheduler should only be creating/using one thread per CPU core (actually one less because the main thread already exists). The task scheduler should not be creating threads on an on going basis -- that would be a red flag right there.

Mako_energy02
Posts: 170
Joined: Sun Jan 17, 2010 4:47 am

Re: CPU multithreading is working!

Post by Mako_energy02 » Fri Sep 15, 2017 12:18 am

The message in the assert was self explanatory when I found it in source. And yes, clearly creating that many threads that often is bad. I do create and destroy threads elsewhere in my engine, but I had greatly underestimated how often btParallelFor is called with this new multi-threaded system. Depending on the amount of activity, I recorded between 2 and 25 calls per frame. I had naively assumed it would be twice, once for the narrowphase and once for the solver. Looking more in depth I see it's called in 5 places, so I guess either some are called multiple times per frame, or simulation sub-steps are involved.

As I said in a previous post, I've been working on this little by little. I have found none of the existing TaskSchedulers suitable for me, so I tried to make my own. I've gone through more bug fixing and alternative approaches than I've listed here, nor do I care to list them all. I focused on C++ threading primitives where I could, because I don't share Bullets allergy to the c++ standard library. Ultimately, I couldn't get acceptable performance with any of my attempts.

I feel that, ultimately, having a system/interface to distribute parrallelfor's is a poor design that forces more poor designs on anyone that uses it. More intelligent systems exist and can be deployed with Bullet. It requires the ability to freely change the Bullet API though, which I understand wasn't something that could be done with this refactor. I feel that is a requirement for good, long-lasting support for CPU multi-threading in Bullet to be realized.

Lunkhound, thank you for your effort. I'm sure others have found it useful and are using it, but I can't. So now I have to decide if the other improvements in the newest version are worth a very significant feature regression (CPU multi-threading) from the version I've been using (2.83).

lunkhound
Posts: 89
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound » Sat Sep 16, 2017 7:23 am

Mako_energy02, thanks for contributing to this thread and sharing your experiences. I appreciate it.

Yes, parallelFor is called typically 5 times per simulation sub-step. In the work I am currently doing where I'm parallelizing the constraint solver, I'm leaning much harder on the task scheduler, and parallelFor gets called potentially hundreds of times per constraint solver. This is not a problem for a well-implemented task scheduler.

Here is a profile graph of my work-in-progress parallel solver using the btDefaultTaskScheduler on a 4-core (very old) computer of mine.
This example compares solving constraints for a single stack of 961 boxes with 7890 contact manifolds (more than 8 manifolds per body) on a single thread and also with my new parallel solver.
In this example, the parallel solver is calling parallelFor more than 300 times.
bullet-profile-bigstack-comparison.png
bullet-profile-bigstack-comparison.png (502.83 KiB) Viewed 874 times
The single threaded solver (SequentialImpulseConstraintSolver) took 21.4ms, whereas the new parallel solver took 12.7ms. So roughly a 1.68X speedup.
However, this is still a work in progress, and I think I can get better results. In the graphic I put red circles showing where threads are sitting idle.
With better batching of constraints I should be able to reduce that idle time and improve performance more.

But it is true that the task scheduler is key to getting good performance. It is extremely easy for a bad task scheduler to cause terrible performance.
I REALLY don't recommend trying to write one from scratch, I suggest using one of the ones provided.

Did you have trouble getting the default task scheduler working?

Mako_energy02
Posts: 170
Joined: Sun Jan 17, 2010 4:47 am

Re: CPU multithreading is working!

Post by Mako_energy02 » Sat Sep 16, 2017 11:24 pm

Did you have trouble getting the default task scheduler working?
I didn't try. I reviewed the code itself and recalled comments about it that you had made. A combination of needing to move 6+ files, concerns over the interface, the lack of inclusion in the core library, and references to Bullet 3 code (I'm still very much using Bullet 2) all eliminated it as a prospect for me. If the default task scheduler has moved to the core lib, then I'll do an update when I eventually revisit this(no idea when) and try that.

Opinionated Rant incoming:

While what you've shown me is faster than what I had achieved, a less than 2x speedup over 4 cores is not what I would have expected. Combining that with the strong recommendation to use a pre-packaged solution, both of these from the author of the system is not as encouraging one would hope. Personally, I feel that either the work is generic enough that overrides and customization can take place, or it's not generic enough and Bullet should lock down the implementation. Currently, the implementation seems stuck in between the two ideologies.

For example, Ogre3D had this issue where in 1.7-2.0, there was a WorkQueue interface that tried a similar approach to the btITaskScheduler. They provided some implementations, such as TBB that people could use. One major difference is that the Ogre WorkQueue didn't focus on ParallelFor's. It was more generic so it could use work given to it by Ogre's terrain system at the time. Regardless, it was large, difficult to maintain, and made some assumptions about how Ogre was to be used in relation to the rest of the engine. Custom WorkQueue's made by users were impractical. It ended up getting scrapped in favor of an entirely internal and much simpler, more focused threading solution that directly addressed Ogre's needs.

I can't say the current design is just like the Ogre WorkQueue, but I do see a number of similarities. If the threading needs of Bullet are so specific that only the author can write implementations, why bother with the interface? Why even try to appeal to the use of other threading solutions? I know I'm making it sound like I'm going back on wanting to change the Bullet API to better accommodate threading solutions, so I'll state directly that either could work. You'd likely have to change the API a little bit regardless due to the modular nature of Bullet.

Then again, maybe I'm just a special snowflake. I am only one person complaining and the threading model I use in my engine is different from nearly every other threading model I've seen. So I have some opinionated views on threading that I know many others don't share. So take what I say with a grain of salt.

Regardless of my opinions on design, I eagerly look forward to future updates to the system.

lunkhound
Posts: 89
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound » Sun Sep 17, 2017 10:35 pm

Mako_energy02 wrote:I didn't try. I reviewed the code itself and recalled comments about it that you had made. A combination of needing to move 6+ files, concerns over the interface, the lack of inclusion in the core library, and references to Bullet 3 code (I'm still very much using Bullet 2) all eliminated it as a prospect for me. If the default task scheduler has moved to the core lib, then I'll do an update when I eventually revisit this(no idea when) and try that.
I agree it should be moved to the core lib, however, it should be quite easy to copy those 8 files (btTaskScheduler.h/cpp, b3ThreadSupportInterface.h/cpp, b3Win32ThreadSupport.h/cpp, b3PosixThreadSupport.h/cpp) into your project and get it working. There are no significant dependencies on Bullet 3 libraries (depite the "b3" prefix on some things). You should be able to easily remove all references to Bullet 3 headers with 3 edits:
  • - in btTaskScheduler.cpp, delete the include of b3Clock.h -- it's not needed
    - in b3ThreadSupportInterface.h, replace the b3Scalar.h include with LinearMath/btScalar.h
    - replace B3_ATTRIBUTE_ALIGNED16 with BT_ATTRIBUTE_ALIGNED16 (I only see one occurance of it)
That's all it should take to use the default task scheduler now. Hopefully in the future it will be part of the core lib and those steps won't be needed.
Mako_energy02 wrote:While what you've shown me is faster than what I had achieved, a less than 2x speedup over 4 cores is not what I would have expected.
I think you've missed the point I was trying to make there. In the official version of Bullet, the constraint solver is single-threaded. If you have many simulation islands of bodies, it is able to run the constraint solvers in parallel on different threads, but each solver is only using a single thread to do it's work. In a best case scenario (many equal-sized simulation islands) you should get close to a 4x speedup on 4 cores for the solving of the constraints (assuming a reasonably performant task scheduler).
In a worst-case scenario, where there is only a single simulation island (all bodies in a single big stack), there is no speedup at all because it is single threaded.
For solving constraints it is much more difficult to extract parallelism from a single simulation island because of data dependencies -- it is not an "embarrassingly parallel" problem like the narrowphase collision, or like solving independent simulation islands in parallel.

The point I was trying to make is that there is nothing wrong with the performance of "parallelFor" as you seemed to be implying. In the example I posted, parallelFor was being invoked over 300 times (far far more than the 25 times you were concerned about), and the performance issues were caused by the batching algorithm of the new (work in progress) parallel constraint solver -- NOT the task scheduler.
If you look closely at the profile graph you can see that the task scheduler is working exactly as it should -- tasks are being quickly dispatched to the worker threads, and the parallelFor finishes very shortly after the last task completes.
Mako_energy02 wrote:Combining that with the strong recommendation to use a pre-packaged solution, both of these from the author of the system is not as encouraging one would hope. Personally, I feel that either the work is generic enough that overrides and customization can take place, or it's not generic enough and Bullet should lock down the implementation. Currently, the implementation seems stuck in between the two ideologies.
My feeling is that if Bullet multithreading is to be useful to the broadest range of projects, it must not dictate a particular task scheduler. Many (most?) game projects already have a custom task scheduler, my own included. I do not want to deal with multiple different task schedulers, each having a set of worker threads, and the added complexity and risk of thread oversubscription issues it would bring.
At the same time, it should just work out of the box by default, for people who don't mind letting Bullet manage threads on its own. That's what the default task scheduler is supposed to be for.
Mako_energy02 wrote:If the threading needs of Bullet are so specific that only the author can write implementations, why bother with the interface? Why even try to appeal to the use of other threading solutions?
ParallelFor is the most generic, highest level threading abstraction I know of. The TBB, OpenMP and PPL implementations all work and are reasonably performant.
Mako_energy02 wrote:Regardless of my opinions on design, I eagerly look forward to future updates to the system.
Stay tuned!

quant
Posts: 3
Joined: Thu Mar 02, 2017 8:44 pm

Re: CPU multithreading is working!

Post by quant » Wed Jan 10, 2018 12:37 am

Thank you for all your work on this.

I'm seeing between 2-4x performance increase with thousands of simulated bodies. Running an i7-7700.

quant
Posts: 3
Joined: Thu Mar 02, 2017 8:44 pm

Re: CPU multithreading is working!

Post by quant » Fri Jan 19, 2018 4:24 am

I've run into a problem when using constraints with kinematic bodies (simulated bodies work fine).

btSimulationIslandManagerMt calls buildAndProcessIslands(), which calls addConstraintsToIslands(), which calls btGetConstraintIslandId(), which returns -1.

The next line, "Island* island = getIsland(islandId);" asserts the id >= 0,

No problems with the non-MT build. Any idea how to fix this?

Post Reply