CPU multithreading is working!

Basroil · Post by **Basroil** » Thu Jan 08, 2015 5:39 pm

lunkhound wrote: These tweaks you mention, are those the fixes for the MLCP solver to not use static data?

Yup, just that (and also avoiding possible issues with callback functions using lambdas rather than static/global functions used in demos ). Been pretty swamped these last few days, but I'll put something up soon. Still need to check the individual MLCP solver types to make sure I didn't miss anything, but at least Dantzig is working quite nicely.

Currently that code isn't using collisions though, so I can't really say if the dbvt issue is there.

Granyte · Post by **Granyte** » Sat Jan 10, 2015 5:54 am

Ok so I have nailed down what could cause a crash

when we don't enable BT_NO_PROFILE sometime two colliding gimpactmesh will cause the profiler to go in an infinite recursion

since I enabeled BT_NO_PROFILE the issue went away

lunkhound · Post by **lunkhound** » Sat Jan 10, 2015 7:52 pm

Granyte wrote:Ok so I have nailed down what could cause a crash

when we don't enable BT_NO_PROFILE sometime two colliding gimpactmesh will cause the profiler to go in an infinite recursion

since I enabeled BT_NO_PROFILE the issue went away

Thanks for the update.

What's odd is that Cmake should have taken care of that for you. When the BULLET2_USE_THREAD_LOCKS option is enabled in Cmake, it should have defined BT_THREADSAFE as well as BT_NO_PROFILE. I'm not sure how you could have had profiling enabled.

Granyte · Post by **Granyte** » Thu Jan 15, 2015 1:37 pm

raytesting against a gimpact mesh does not seem thread safe i'm getting a crash at

btConvexPolyhedron::testContainment

during a call to malloc that's all the informations I have for now

lunkhound · Post by **lunkhound** » Fri Jan 16, 2015 11:17 am

Granyte wrote:raytesting against a gimpact mesh does not seem thread safe i'm getting a crash at

btConvexPolyhedron::testContainment
during a call to malloc that's all the informations I have for now

Thanks for the info.

However, I haven't been able to reproduce it. I also looked at the code for btConvexPolyhedron but I didn't see anything suspicious. I didn't even see any connection between gimpact or raytesting and btConvexPolyhedron.

A callstack would be helpful in tracking this down. Do you have a reliable way to reproduce this crash?

[edit] One thing you might want to try is adding a line at the beginning of btConvexPolyhedron::testContainment:

Code: Select all

    btAssert( !btThreadsAreRunning() );

and you'll need to add this include line near the top of the file:

Code: Select all

#include "LinearMath/btThreads.h"

Assuming you are bracketting your threaded sections with btPushThreadsAreRunning()/btPopThreadsAreRunning() calls, you should get an assert fire (in debug build only!) if anything tries to call that code while the worker threads are going. I use that technique all the time to find threading problems.

I've already tried this locally, and I'm doing tons of raytests with a large static gimpact mesh and a smaller dynamic gimpact mesh. The assert doesn't fire for me which tells me that that function isn't even being called while the worker threads are going.
OK, tried setting a breakpoint, and that function isn't ever being called at all.

Granyte · Post by **Granyte** » Sat Jan 17, 2015 4:10 am

i'll test more my debugger does not tell me more ATM

I only get a stack call without any more data it won't even read the method source when I crash
EDIT also I was wondering in OpenMP what does schedule(static, 50) mean?

I have been converting your classes to the msvc concurrency namespace and I don't know what this mean so I cannot convert it

Code: Select all

#pragma omp parallel for schedule(static, 50)
            for ( int i = 0; i<m_nonStaticRigidBodies.size(); i++ )
            {
                btRigidBody* body = m_nonStaticRigidBodies[i];
                if (!body->isStaticOrKinematicObject())
                {
                    //don't integrate/update velocities here, it happens in the constraint solver
                    body->applyDamping(timeStep);
                    body->predictIntegratedTransform(timeStep,body->getInterpolationWorldTransform());
                }
            }
        }

Flix · Post by **Flix** » Sat Jan 17, 2015 10:03 am

About schedule(static, 50):

This is how OpenMP partitions the threads:
50 should mean that each thread takes 50 operations at a time.
static means that the partition is made ahead of time (less time to partition, but some threads may end up early and have nothing left to do).

Alternatives are dynamic (= the partition is repeated after each thread has finished its 50 operations -> slower, but threads that has nothing to do are reused), and guided (= AFAIK it should be a half-way between the two: the partition starts as static at the beginning and becomes more dynamic at the end, when it's easier that some thread has nothing to do).

This is just what I remember about it. Please refer to proper OpenMP docs for more reliable info.

P.S. Also note that in the code AFAIR 50 refers to the number of rigid bodies: that means that a simulation with less than 50 rigid bodies is still using a single thread (well, that's what I can understand about it). We could try lowering this number and see what happens with less than 50 bodies.

lunkhound · Post by **lunkhound** » Sat Jan 17, 2015 8:05 pm

If you are converting to the concurrency PPL (Parallel Patterns Library) in MSVC, you'd be better off looking at the TBB code since PPL is a virtual carbon copy of TBB.

The "schedule(static, 50)" part corresponds to the "partitioner" and grainsize in TBB/PPL terms. They are just for performance tuning -- there is no effect on program logic or correctness of execution.

For the TBB code, I ended up using the simple_partitioner everywhere. I tried the auto_partitioner, but it seemed to cause greater fluctuations in performance, and the performance didn't seem any better overall than the simple_partitioner. When I translated it to OpenMP I didn't see anything corresponding to the simple_partitioner so I just put "static" everywhere. I think when I was testing OpenMP I didn't have the thread count set correctly, so it may be better to use "guided" or something. The demo with OpenMP doesn't perform as well as TBB when there are too many worker threads.

And about the grainsize (the "50" parameter), Flix is right. You want to tune that number so that it will avoid all the overhead of partitioning the work and synchronizing threads if it is faster to just do the work single-threaded. If the grainsize is too small, you incur excessive overhead from the task scheduler, if the grainsize is too large, then the tasks won't be spread evenly amonst the threads and some threads will spend too much time sitting idle.

[edit] Oh and the grainsize is the number of iterations each task will do. So for some loops each iteration is a rigid body, for others its an overlapping pair of bodies, and for others its a simulation island. I think in the quoted example it was rigid bodies, but that's not always the case.

lunkhound · Post by **lunkhound** » Thu May 21, 2015 1:02 am

I just put up a new version of CPU multithreading for Bullet 2 on github. This verison is based on the latest version of Bullet 2.83 on github.

Changes compared to previous:

- The Multithreaded Demo is now built into the example browser (its under "experimental").
- Microsoft PPL support added (only available for MSVC 2010 or later)
- Demo now allows switching between OpenMP, TBB, PPL and single-threaded at runtime
- Built-in bullet profiling doesn't need to be disabled, it has been fixed so that it only gathers timings from the main thread
- Added CMake options to enable OpenMP and PPL support
- Demo code is cleaned up quite a bit.

In the new demo, there are around 11000 boxes in various piles, and they have had sleeping disabled. If you run the demo long enough some of the stacks of boxes will fall down, but they should stay standing long enough to use the profile viewer and switch between the different threading options to get an idea of how performance is affected.

In order to build the demo, in Cmake make sure to enable BULLET2_USE_THREAD_LOCKS and also BUILD_BULLET2_DEMOS. Then do a "configure" and 3 new options should appear for the multithreaded demo in 3 flavors: OpenMP, TBB, and PPL. Select your desired flavors. For TBB you'll need to install it separately, and then point Cmake to the include and lib directories.
OpenMP should be available on recent versions of GCC, Clang and MSVC.

lunkhound · Post by **lunkhound** » Thu May 21, 2015 9:18 pm

Here is a screenshot of what the new version of the demo looks like:

bullet-example-browser.jpg

Here is the profile view in single threaded mode, followed by multi-threaded:

bullet-example-browser-single-threaded.jpg

bullet-example-browser-tbb.jpg

My machine has a quad-core CPU with hyperthreading, so by default I'm using 8 threads, but I wouldn't expect to get more than a 4x speed improvement.
So solveGroup (the constraint solver) went from 132.9ms to 32.6ms (roughly 4 times faster). That's probably the most significant change right there. In fact you'll notice that in single-threaded mode it gets called 96 times, while in the multithreaded run it is only called 12 times. That's because each of the 8 threads is calling it 12 times but the profiler is only recording timing for the main thread.
Another area that shows significant speedup is dispatchAllCollisionPairs (narrowphase collision detection). That went from 33.6ms to 8.3ms (also a 4x speedup).
Other areas that show a speedup are predictUnconstraintMotion (3.07ms to 0.83ms), createPredictiveContacts (2.05ms to 0.41ms), and integrateTransforms (2.90ms to 1.00ms).

A few areas that are not parallelized at all are updateAabbs (the broadphase collision), calculateSimulationIslands (the process of generating the simulation islands), synchronizeMotionStates, and a few others.

The net effect in this case is that the stepSimulation goes from 184.7ms to 55.2ms, an overall speedup of around 3.3x. These are the same performance results as before, the main difference is that now the built-in profiling can be used to see it.

gdlk · Post by **gdlk** » Wed May 27, 2015 7:08 pm

Nice!!

A question about mlcp... to use it, in the multithreaded demo example, in the class MyConstraintSolverPool, with change

Code: Select all

        
-            m_solverType = BT_SEQUENTIAL_IMPULSE_SOLVER; to
+            m_solverType = BT_MLCP_SOLVER;

-            btConstraintSolver* solver = new btSequentialImpulseConstraintSolver(); 
+            btDantzigSolver* mlcp = new btDantzigSolver();
+            btConstraintSolver* solver = new btMLCPSolver(mlcp);

is enough? (I tried it, but got lower performance that single thread D= (with a complex scene with too much objects )). What I am missing??

Thanks!! (and great work!! =D )

lunkhound · Post by **lunkhound** » Wed May 27, 2015 11:02 pm

The MLCP solver is not threadsafe at the moment. The issue is that there are a number of scratch working structures that are declared "static". Open up btMLCPSolver.cpp and search for "static" and you'll see them. With multiple threads all trying to read and write to the same scratch data simultaneously, it goes off the rails. On my machine it just crashed.

It's pretty easily fixed. As a quick and dirty fix, just make all those variables non-static. That way each thread has its own copy. However, you may be incurring some extra memory allocations (that's the only reason those variables are static -- to avoid allocating and deallocating them with every call).

A slightly more involved but still simple fix is to move those variables into the class as protected members. That avoids the memory allocations, and each thread will get its own copy because each thread has its own solver instance.

I tried this and was able to get it working, however the performance of the MLCP solver was *really* bad compared to the sequential impulse solver. I had to drastically reduce the number of bodies to be able to interact with the demo at all. The MLCP solver was slower with ~400 bodies than sequential impulse is with ~11000 bodies.

Thanks for the feedback!

gdlk · Post by **gdlk** » Thu May 28, 2015 12:01 am

Thanks! I will give a try with that tips when I can =D

Yep, mlcp is toooo slow and because that I think it could be get more benefits with multithread approach (collision between 5 objects + some constraints will take over 50ms easily).

Regards!!

lunkhound · Post by **lunkhound** » Thu May 28, 2015 5:51 am

I updated my branch in github with the fix to make the MLCP solver threadsafe.

ai-music · Post by **ai-music** » Wed Jun 10, 2015 2:57 pm

Nice work! Thanks. Really good initiative!

But check CCD, it working incorrect. Trouble like this - https://code.google.com/p/bullet/issues/detail?id=356
video - http://www.youtube.com/watch?v=Q17MnAMujTI

And when i tested your code in my application (miltithreading mode with kinematic characters, dynamic rigid bodys, static concave meshes) i noticed not smoothing work of physical system. Sometimes SimulationStep takes too long. It is not noticeable on a simple example with cubes on static plane.

Regards.

Real-Time Physics Simulation Forum

CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!

Re: CPU multithreading is working!