Nasty bug in resolveSplitPenetrationSIMD

Laurent Coulon · Post by **Laurent Coulon** » Tue Oct 11, 2011 10:43 pm

We've just found a really bad bug in resolveSplitPenetrationSIMD, since the SIMD version is enabled by default we thought we'd share it right away.
This line:

Code: Select all

        c.m_appliedImpulse = _mm_or_ps( _mm_and_ps(resultLowerLess, lowerLimit1), _mm_andnot_ps(resultLowerLess, sum) );

should actually be:

Code: Select all

        c.m_appliedPushImpulse= _mm_or_ps( _mm_and_ps(resultLowerLess, lowerLimit1), _mm_andnot_ps(resultLowerLess, sum) );

The bug leads to quite a number of bad behaviors: Bogus rotations, warmstarting getting messed up and of course push impulses not really pushing objects out which kinds of defeats the purpose of solving for penetrations.

Incidentally this is not the only bug in the SIMD version of the constraint solver, the SIMD constraint algorithms are broken for all constraint types, not just push impulses. The code accumulates rotational impulses in static objects as well as dynamics, thus messing up the deltaImpulse computation for the next iteration. Since we found that the SIMD implementation runs actually slower that the C++ version we did not bother writing a fix though.

Erwin Coumans · Post by **Erwin Coumans** » Wed Oct 12, 2011 8:22 pm

Thanks for the report, it is fixed in latest trunk. Split impulse isn't enabled by default and not well tested. We should add some unit tests.

SIMD constraint algorithms are broken for all constraint types [...] The code accumulates rotational impulses in static objects as well as dynamics

Static objects with mass zero should not accumulate any impulse, as c.m_angularComponentA should be the zero vector.

It is not clear what the issue is exactly. Can you explain more in detail, or provide a reproduction case?

Since we found that the SIMD implementation runs actually slower that the C++ version

I'm surprised you don't see any speedup, but slowdown. Our benchmark shows good speedup for the SIMD version, but it depends on the usage of course.

What compiler/platform are you using?
Thanks,
Erwin

Laurent Coulon · Post by **Laurent Coulon** » Thu Oct 13, 2011 6:13 pm

Static objects with mass zero should not accumulate any impulse, as c.m_angularComponentA should be the zero vector.

That should be correct. What we observed was that for an identical input, resolveSingleConstraintRowLowerLimit did not yield the same result when going through the SIMD version. When looking over the code the only difference we could notice was that this line of the C++ code

Code: Select all

if (m_inverseMass)

Code: Select all

	SIMD_FORCE_INLINE void internalApplyImpulse(const btVector3& linearComponent, const btVector3& angularComponent,const btScalar impulseMagnitude)
	{
		if (m_inverseMass)
		{
			m_deltaLinearVelocity += linearComponent * impulseMagnitude;
			m_deltaAngularVelocity += angularComponent * impulseMagnitude;
		}
	}

was not present in the SIMD version. So we assumed the static object in the pair was getting a non-zero deltaAngularVelocity which affected the next iteration's deltaVel1Dotn.
We could very well have been wrong about this deduction, like I said we did not explore the SIMD version much since we found it slower than the C++ one.

I'm surprised you don't see any speedup, but slowdown. Our benchmark shows good speedup for the SIMD version, but it depends on the usage of course.

What compiler/platform are you using?

We are compiling on Dev Studio 2008 and we tested on a WinXP AMD PC. I'm not sure how the SIMD version is expected to yield a speed up since it is used strictly to do scalar math with a single value in the vector. Essentially you get all the overhead of setting up the SIMD computation and none of the benefits of being able to do math on 4 values at a time.
On that topic, it seems that SIMD would be a good candidate to have the solver solve the constraints by block (one block per contact manifold).

Dirk Gregorius · Post by **Dirk Gregorius** » Thu Oct 13, 2011 10:00 pm

On that topic, it seems that SIMD would be a good candidate to have the solver solve the constraints by block (one block per contact manifold).

Well, that would move the solver into the direction of a Jacobi like stepper then. It might be a good academic project to test a parallized and SIMD Jacobi style solver and compare it to the Gauss-Seidel version. The convergence of Jacobi is worse than GS so it will need significantly more iterations. I wonder if a highly optimized Jacobi solver can catch to Gauss-Seidel at some point. Next we could maybe interleave Jacobi with Gaus-Seidel iterations. An example would be to solve the four non-penetration constraints in a contact with SIMD as Laurent points out. This might gain some speed-up, but will also reduce the quality. As a nice side effect we would not get artificial rotations for a falling box which lands exactly on a face.

Laurent Coulon · Post by **Laurent Coulon** » Thu Oct 13, 2011 11:38 pm

What I had in mind is not really a Jacobi solver. It would still be a relaxation system but solve each manifold in a direct manner at each iteration (I believe Erin Catto implemented something similar in Box2D a while ago). If anything I would expect the convergence to be faster.
The stable stacking case was definitely a big motivation for the suggestion.

We have been experimenting with some changes in the order of the simulation steps in Bullet to improve resting objects stability. We have obtained some very positive results so far. Most significantly, we were able to get rid of the lifetime variable in the contact points (a major cause of jitter in Bullet). I will be out of the country for the next two weeks but I will be happy to share some of our results after I come back.

Dirk Gregorius · Post by **Dirk Gregorius** » Fri Oct 14, 2011 12:20 am

Box2D solves two non-penetration constraints as a block. Actually I implemented this together with Erin in Box2D. Solving four non-penetration constraints using Direct Enumeration is quite expensive even with SIMD. Of course you would get better quality, but it will be significantly slower. We could solve diagonal contact points as a block which yields two 2x2 LCP as a compromise. This might add some stability since you remove some artificial torques. I implemeted this some years ago and I didn't get much from with full manifolds. I can imagine that this adds some stability to incremental manifolds though.

peterk · Post by **peterk** » Fri Nov 11, 2011 2:21 pm

Just to add another data point, on Linux/gcc I'm seeing a decent performance boost when enabling SIMD on contact heavy sims.

Also, I've achieved another decent improvement by replacing the _vmathVfDot3 method with the SSE4 _mm_dp_ps and also replacing the following construct

_mm_or_ps( _mm_and_ps( mask, a ), _mm_andnotps( mask, b ) )

with the SSE4 _mm_blendv_ps

I've made the above into a SIMD_SELECT macro and have a define to enable the SSE4 variants. I've also written an SSE4 version of btConvexShape.cpp::convexHullSupport which gives a good speed up for shapes with a larger number of points.

Are these worth submitting a patch for?

Real-Time Physics Simulation Forum

Nasty bug in resolveSplitPenetrationSIMD

Nasty bug in resolveSplitPenetrationSIMD

Re: Nasty bug in resolveSplitPenetrationSIMD

Re: Nasty bug in resolveSplitPenetrationSIMD

Re: Nasty bug in resolveSplitPenetrationSIMD

Re: Nasty bug in resolveSplitPenetrationSIMD

Re: Nasty bug in resolveSplitPenetrationSIMD

Re: Nasty bug in resolveSplitPenetrationSIMD