rayTest from multiple thrads (debug vs. release performance)

B_old
Posts: 79
Joined: Tue Sep 26, 2006 1:10 pm

rayTest from multiple thrads (debug vs. release performance)

Post by B_old »

Hello,
I asked a similar question in the second post of my other thread, but I am opening a new thread because the other topic wasn't really approriate any longer.
I'm simulating several particle systems in parallel and the performance usually increases with the number of threads. Since I added rayTest() to the simulation, performance does not scale with the number of threads! What could be the reason for this?
One strange thing I noticed is, that performance will still scale when using the debug-build of bullet. On my machine (3 cores) the simulation is actually faster with the debug-build if I'm using 3 threads, because it really runs in parallel.
Not really sure what I should start looking for. Hints are greatly appreciated.
Thanks!
B_old
Posts: 79
Joined: Tue Sep 26, 2006 1:10 pm

Re: rayTest from multiple thrads (debug vs. release performance)

Post by B_old »

Ok, I think I can give a slightly better explanation.
I have several particle systems going, with many calls to rayTest(). Lets say one system takes D ms to simulate with debug-build of bullet and R ms with release. Using 3 threads on my 3 core machine I can simulate 3 systems in D ms in debug but it takes 3*R ms in release mode, because the simulation of one (1) system suddenly takes 3*R ms.
This is with bullet 2.75. With 2.74 I eventually get a crash during the internal profiling.

I experience different behavior by only switching between bullet debug and release? But that doesn't guarantee that the error originates in the bullet code, right?

I would be very thankful for any kind of comment that can give me an idea about whats going on, because right now I'm at a loss. I don't really know where I should start looking.
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: rayTest from multiple thrads (debug vs. release performance)

Post by Erwin Coumans »

btCollisionworld::rayTest might be not thread-safe / re-entrant (I've not tested/reviewed this feature).

Can you re-create a Bullet demo with the multithreaded ray cast, and zip its source?
Thanks,
Erwin
B_old
Posts: 79
Joined: Tue Sep 26, 2006 1:10 pm

Re: rayTest from multiple thrads (debug vs. release performance)

Post by B_old »

Thanks for the answer!
I will try to reproduce the problem in a demo, but it might take a while.

While searching for a solution I stumbled over this http://software.intel.com/en-us/article ... n-game-ai/, where it seems to work.
I'm generally not confident enough to solely blame bullet right now, as all seems to work as expected when using the debug-build. Could it be a problem with how I am compiling/linking the library? What could potentially cause rayTest() to not be reentrant?
B_old
Posts: 79
Joined: Tue Sep 26, 2006 1:10 pm

Re: rayTest from multiple thrads (debug vs. release performance)

Post by B_old »

A small update.

Choosing disableRaycastAccelerator = true solves the problem for me. The RayCastAccelerator isn't reentrant on my system because of malloc, I think.

In my test case (3 x 4096 raytests per frame) performance was dramatically increased when disabling the raycast accelerator because I can perform the tests concurently (4096 per thread, 3 threads).

Unfortunately I have now idea how it affects performance in a more realistic scenario where I only might use a few raytests. How much does bullet use rayTest() internally?
Is there an official way to have the good of both worlds? Accelerator for internal use, no accelerator for my concurrent calls?

EDIT 1: I have to say that my test has very few collision objects, so that is probably a reason why the test only benefits from disabling the accelerator. Again, I can imagine the drawback being more severe in a more realistic setting.

EDIT 2: Seems that concurrently calling rayTest() will not always end well with compound shapes in the scene. :(
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: rayTest from multiple thrads (debug vs. release performance)

Post by Erwin Coumans »

If you implement a thread-safe allocator you might be able to make the raycast accelerator thread-safe.

///rayTest is a re-entrant ray test, and can be called in parallel as long as the btAlignedAlloc is thread-safe (uses locking etc)

Can you try to add a locking mechanism to btAlignedAlloc and see if that helps?
Thanks,
Erwin
B_old
Posts: 79
Joined: Tue Sep 26, 2006 1:10 pm

Re: rayTest from multiple thrads (debug vs. release performance)

Post by B_old »

Erwin Coumans wrote: Can you try to add a locking mechanism to btAlignedAlloc and see if that helps?
I would think malloc, and therefore btAlignedAlloc, is thread-safe with vc/win32? That would explain the performance issue.
What really confuses me now, is that I get a crash because of compound shapes and it seems to be in stepSimulation() and not during my parallel raytests.

EDIT: I trigger btAssert (colObj->getCollisionShape()->isCompound()); sometime during stepSimulation() after the concurrent raytests. It seems to happen only with compound shapes in the scene and the rayTestAccelerator will decrease the chance of a crash. Probably because it only happens when a collisionobject is touched by ray's concurrently.
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: rayTest from multiple thrads (debug vs. release performance)

Post by Erwin Coumans »

ray test or collision checks against a btCompoundShape is not thread safe, it temporarily overwrites some internal data.

Code: Select all

collisionObject->internalSetTemporaryCollisionShape((btCollisionShape*)childCollisionShape);
You could try to add locking to avoid this, search for internalSetTemporaryCollisionShape in btCollisionWorld.cpp
Thanks,
Erwin
B_old
Posts: 79
Joined: Tue Sep 26, 2006 1:10 pm

Re: rayTest from multiple thrads (debug vs. release performance)

Post by B_old »

Thanks for the hint.

My singleRayTest() now looks like this:

Code: Select all

//...
// replace collision shape so that callback can determine the triangle
if (!recursive) s_lock->Lock();
btCollisionShape* saveCollisionShape = collisionObject->getCollisionShape();
collisionObject->internalSetTemporaryCollisionShape((btCollisionShape*)childCollisionShape);
					

rayTestSingle(rayFromTrans,rayToTrans,
                   collisionObject,
		   childCollisionShape,
		   childWorldTrans,
		   resultCallback,
		   true);

// restore
collisionObject->internalSetTemporaryCollisionShape(saveCollisionShape);
if (!recursive) s_lock->Unlock();
//...
That certainly gets rid of any crash, but it does end up in some kind of dead lock after a while, basically crawling to a halt.

Any idea what that could be about?

EDIT 1: The deadlock problem seems to be lessened by disabling the rayTestAccelerator, but generally the lock is quite bad for performance.
Not really sure yet how I will handle this. I'm reluctant to give up parallel processing of my particle systems altogether. I guess I make a thread-safe wrap around rayTest() on the application level, which I'll use from inside the particle systems.
Do you see any change coming in that area? Maybe I'm missing a detail that explains why this never will be a good idea. :)
Thanks for the feedback so far!

EDIT 2: In my case the easiest and fastest solution turns out to be creating a temporary collision object on the stack. No need for locking and at least 2x as fast in my test.

Code: Select all

btCollisionObject stackObject = *collisionObject;

const btCompoundShape* compoundShape = static_cast<const btCompoundShape*>(collisionShape);
int i=0;
for (i=0;i<compoundShape->getNumChildShapes();i++)
{				
	btTransform childTrans = compoundShape->getChildTransform(i);
	const btCollisionShape* childCollisionShape = compoundShape->getChildShape(i);
	btTransform childWorldTrans = colObjWorldTransform * childTrans;

	stackObject.internalSetTemporaryCollisionShape((btCollisionShape*)childCollisionShape);
					
	rayTestSingle(rayFromTrans,rayToTrans,
	      	      &stackObject,
	              childCollisionShape,
	              childWorldTrans,
	              resultCallback);
}
I still think I'll end up with what I described above, though.

EDIT 3: I can only assume, that internalSetTemporaryCollisionShape is needed for the result-callback. Maybe it would help if object and shape would be separated there as well, like in singleRayTest. Or am I missing something?
B_old
Posts: 79
Joined: Tue Sep 26, 2006 1:10 pm

Re: rayTest from multiple thrads (debug vs. release performance)

Post by B_old »

Just a little update, in case anybody stumbles over a similar problem.

Creating an object on the stack and passing the address maybe isn't the smartest idea, but I didn't really understand what is done with it later anyway. For now I just ignore (comment out) internalSetTemporaryCollisionShape() ect. and it seems to work perfectly.
Any idea what I could be breaking?

Is rayTest() used by the simulation for anything?

Regarding performance of rayTestAccelerator I conclude that it starts to pay off even with a relatively small amount of collision objects and several threads.