CPU multithreading is working!

Post Reply
kingchurch
Posts: 28
Joined: Sun May 13, 2012 7:14 am

Re: CPU multithreading is working!

Post by kingchurch »

Awesome work! Will the multi-threading optimization work on iOS/Neon processors ?
lunkhound
Posts: 99
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound »

ai-music wrote:Nice work! Thanks. Really good initiative!

But check CCD, it working incorrect. Trouble like this - https://code.google.com/p/bullet/issues/detail?id=356
video - http://www.youtube.com/watch?v=Q17MnAMujTI

And when i tested your code in my application (miltithreading mode with kinematic characters, dynamic rigid bodys, static concave meshes) i noticed not smoothing work of physical system. Sometimes SimulationStep takes too long. It is not noticeable on a simple example with cubes on static plane.

Regards.
It might help to know some additional details about your setup.
Which constraint solver are you using? Was it the MLCP one?
If it is the MLCP solver, can you try it with the sequential impulse solver and see if the problem persists?

Which task scheduler are you using?
Another thing to try is reducing the number of threads. Some CPUs will report 2 hardware threads per core (hyperthreading) which may lead to too many threads. In that case you may get better performance with just one thread per core.
lunkhound
Posts: 99
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound »

kingchurch wrote:Awesome work! Will the multi-threading optimization work on iOS/Neon processors ?
In theory it should work on iOS (assuming TBB or OpenMP is available). However I don't know if anyone has built it for iOS. Some cmake tweaks might be needed.

If you try it on iOS, please report the results back here.
ai-music
Posts: 8
Joined: Wed Jun 10, 2015 2:41 pm

Re: CPU multithreading is working!

Post by ai-music »

lunkhound wrote: Which constraint solver are you using? Was it the MLCP one?
If it is the MLCP solver, can you try it with the sequential impulse solver and see if the problem persists?
Setup like your sample (MultiThreadingDemo) - sequential impulse solver etc... CCD working incorrect without multithreading too (but clean bullet-3-master (2.83) working correct). Maybe you can see this bug when you try shooting (some CCD rigid body like a bullet) to example boxes. And maybe createPredictiveContact() or the nearest functions do not work properly.
lunkhound wrote: Which task scheduler are you using?
OpenMP for MSVC 2010.
lunkhound wrote: Another thing to try is reducing the number of threads. Some CPUs will report 2 hardware threads per core (hyperthreading) which may lead to too many threads. In that case you may get better performance with just one thread per core.
Test processors is AMD (FX) QuadCore and AMD Athlon 64 DualCore. i tried to reduce and expand number of threads, but I got the same result on hard complex physical scene (concave static, convex kinematic and dynamic rigid bodys)...

UPDATE: I will try to fix OpenMP version for MSVC 2010 like this : http://stackoverflow.com/questions/4738 ... 8-and-2010

Regard.
lunkhound
Posts: 99
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound »

ai-music wrote: Setup like your sample (MultiThreadingDemo) - sequential impulse solver etc... CCD working incorrect without multithreading too (but clean bullet-3-master (2.83) working correct). Maybe you can see this bug when you try shooting (some CCD rigid body like a bullet) to example boxes. And maybe createPredictiveContact() or the nearest functions do not work properly.
Can you post a snippet of code showing how you create the CCD rigid body?
ai-music
Posts: 8
Joined: Wed Jun 10, 2015 2:41 pm

Re: CPU multithreading is working!

Post by ai-music »

lunkhound wrote: Can you post a snippet of code showing how you create the CCD rigid body?

Code: Select all

//CCD example
//by default CCD is enabled (world->getDispatchInfo().m_useContinuous == true)
//when create bullet dynamic rigid body (sphere shape with radius == 1.f and mass == 1.f) use this:
body->setCcdSweptSphereRadius(0.5f); //max 1.f
body->setCcdMotionThreshold(1.f);
//for shooting use this:
btVector3 dir(0.f, 0.f, 1.f); //any direction
dir *= 250.f; //any multiply-factor
body->applyCentralImpulse(dir); 
lunkhound
Posts: 99
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound »

I tried to reproduce Ccd problem you mentioned. Here is what I did.

Go to the file bullet3/examples/Benchmarks/BenchmarkDemo.cpp, line 116 (right after the resetCamera() method of the BenchmarkDemo class). Add the following method:

Code: Select all

    virtual bool keyboardCallback( int key, int state )
    {
        bool handled = false;
        if ( state )
        {
            if ( key == 'n' )
            {
                // Ccd ball
                btTransform sphereTrans;
                sphereTrans.setIdentity();
                sphereTrans.setOrigin( btVector3( -20.f, 200.f, -20.f ) );
                btSphereShape* ball = new btSphereShape( 1.f );
                m_collisionShapes.push_back( ball );
                btRigidBody* ballBody = createRigidBody( 1.f, sphereTrans, ball );
                ballBody->setCcdMotionThreshold( 1.f );
                ballBody->setCcdSweptSphereRadius( 0.5f );
                ballBody->setLinearVelocity( btVector3( 0.f, -250.f, 0.f ) );
                m_guiHelper->createCollisionShapeGraphicsObject( ball );
                m_guiHelper->createCollisionObjectGraphicsObject( ballBody, btVector3( 1.f, 1.f, 0.f ) );
                handled = true;
            }
        }

        return handled;
    }
Compile in Release (benchmark demos won't appear otherwise). When you run any of the benchmark demos in the example browser, pressing the 'n' key will launch a sphere downwards at 250 meters per second. I tried it with "1000 stack" (adjusted the start location to impact the pyramid of boxes), as well as the "Convex stack", "prim vs mesh", and "convex vs mesh".

I applied this change on top of my patch as well as before my patch. I couldn't see any differences in behavior, and Ccd appeared to be working fine.

I also pasted that code (with slight modifications) into the MultithreadedDemo, and it seemed to work there as well.
If you can reproduce the problem in the example browser, I'll take another look.
ai-music
Posts: 8
Joined: Wed Jun 10, 2015 2:41 pm

Re: CPU multithreading is working!

Post by ai-music »

lunkhound wrote: If you can reproduce the problem in the example browser, I'll take another look.
Try this code:
MultiThreadedDemo.cpp
(28.99 KiB) Downloaded 2012 times
(press 'n' key once for long time) and you can see effect as at video: https://youtu.be/A8SPOrGukcw
lunkhound
Posts: 99
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound »

ai-music wrote:
lunkhound wrote: If you can reproduce the problem in the example browser, I'll take another look.
Try this code:
MultiThreadedDemo.cpp
(press 'n' key once for long time) and you can see effect as at video: https://youtu.be/A8SPOrGukcw
Pretty sure I found the bug. I updated my repo with the fix.

Thanks for your help in tracking that down!
ai-music
Posts: 8
Joined: Wed Jun 10, 2015 2:41 pm

Re: CPU multithreading is working!

Post by ai-music »

Thanks for answers. I'll try to find out the cause of the brake openMP at MSVC 2010.
lunkhound
Posts: 99
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound »

ai-music wrote:Thanks for answers. I'll try to find out the cause of the brake openMP at MSVC 2010.
If the multithreadedDemo is crashing when you launch spheres into the scene, it may be running out of persistent manifolds.

In that case, remove this line:

Code: Select all

    m_dispatcher->setDispatcherFlags( btCollisionDispatcher::CD_DISABLE_CONTACTPOOL_DYNAMIC_ALLOCATION );
That demo is pretty close to the limit of 32767 manifolds at the outset. When you start shooting spheres in, it can very easily hit that limit. When that happens and that flag is set, Bullet will crash with a null pointer dereference.

Try this version (use 'y' key to launch spheres):
MultiThreadedDemo.cpp
(28 KiB) Downloaded 1986 times
ai-music
Posts: 8
Joined: Wed Jun 10, 2015 2:41 pm

Re: CPU multithreading is working!

Post by ai-music »

The same result: https://youtu.be/faK1yXDL6fM
I think that bug around functions included in internalSimulationStep()... Need to compare with clean-2.83.

UPDATE: for more performance OpenMP at all versions of MSVC add env. variable:

Code: Select all

#ifdef WIN32
_putenv_s("OMP_WAIT_POLICY", "PASSIVE");
#endif
ref.: http://stackoverflow.com/questions/2074 ... controlled
lunkhound
Posts: 99
Joined: Thu Nov 21, 2013 8:57 pm

Re: CPU multithreading is working!

Post by lunkhound »

ai-music wrote:The same result: https://youtu.be/faK1yXDL6fM
I think that bug around functions included in internalSimulationStep()... Need to compare with clean-2.83.
Did you apply the bugfix? That video looks exactly like the bug that I fixed.
ai-music
Posts: 8
Joined: Wed Jun 10, 2015 2:41 pm

Re: CPU multithreading is working!

Post by ai-music »

lunkhound wrote: Did you apply the bugfix? That video looks exactly like the bug that I fixed.
Oh thanks. It works. I'll test multithread mode and let you know in case of failure.
ai-music
Posts: 8
Joined: Wed Jun 10, 2015 2:41 pm

Re: CPU multithreading is working!

Post by ai-music »

Finally, after tests i changed sheduler to PPL (at MSVC 2010 OpenMP works not smoothly).
But for MSVC 2010 need to change code for PPL in ParallelFor.h (because partitioner-argument is not supported for parallel_for) like this:

Code: Select all

template <class TBody>
struct PplBodyAdapter
{
    int i_grain;
    int i_end;
    const TBody* mBody;

    void operator()( int i ) const
    {
        mBody->forLoop( i, (std::min)( i + i_grain, i_end ));
    }
};
#endif // #if USE_PPL

Code: Select all

//ParallelFor function
#if USE_PPL
    if ( gTaskApi == apiPpl )
    {
        // PPL dispatch
        PplBodyAdapter<TBody> pplBody;
        pplBody.mBody = &body;
		pplBody.i_grain = grainSize;
		pplBody.i_end = iEnd;
        Concurrency::parallel_for( iBegin,
                                   iEnd,
								   grainSize,
                                   pplBody);
        return;
    }
#endif //#if USE_PPL
And when used only one of (TBB or PPL) scheduler - initTaskScheduler() set api == apiNone. Small fix:

Code: Select all

static void initTaskScheduler()
{
#ifdef USE_PPL
    setTaskApi( apiPpl );
#endif
#ifdef USE_TBB
    setTaskApi( apiTbb );
#endif
#ifdef USE_OPENMP
    setTaskApi( apiOpenMP );
#endif
}
UPDATE: fix for Debug mode (exception when the scene is clean):

Code: Select all

virtual void dispatchAllCollisionPairs( btOverlappingPairCache* pairCache, const btDispatcherInfo& info, btDispatcher* dispatcher ) BT_OVERRIDE
    {
        int grainSize = 40;  // iterations per task
        int pairCount = pairCache->getNumOverlappingPairs();
		if (pairCount > 0) //ADDED
		{
			Updater updater;
			updater.mCallback = getNearCallback();
			updater.mPairArray = pairCache->getOverlappingPairArrayPtr(); //here is exeption (null pointer access)
			updater.mDispatcher = this;
			updater.mInfo = &info;

	        btPushThreadsAreRunning();
			parallelFor( 0, pairCount, grainSize, updater );
			btPopThreadsAreRunning();
		}
    }
And now my gameEngine works even faster. Thanks a lot for your work! :)

PS: Erwin has to know about it and maybe add MT-support officialy.
Post Reply