It contains recent bug fixes and a new optimized x86 SIMD SSE innerloop for the constraint solver.
Please download the latest version from Google Code:
http://code.google.com/p/bullet/downloads/list
Enjoy,
Erwin
Bullet 2.73 has been updated to Bullet 2.73 SP1.
- Erwin Coumans
- Site Admin
- Posts: 4221
- Joined: Sun Jun 26, 2005 6:43 pm
- Location: California, USA
- Contact:
- Dragonlord
- Posts: 198
- Joined: Mon Sep 04, 2006 5:31 pm
- Location: Switzerland
- Contact:
Re: Bullet 2.73 has been updated to Bullet 2.73 SP1.
Do you really have to make this optimization ( SSE ) on your own? Should GNU/GCC not do this already for you when possible?
- Erwin Coumans
- Site Admin
- Posts: 4221
- Joined: Sun Jun 26, 2005 6:43 pm
- Location: California, USA
- Contact:
Re: Bullet 2.73 has been updated to Bullet 2.73 SP1.
Can you try it out?Do you really have to make this optimization ( SSE ) on your own? Should GNU/GCC not do this already for you when possible?
I'm curious to see the results, but I doubt GCC removes the branches automatically and perform auto-vectorization as good as manual work. Visual Studio 2008 doesn't perform auto-vectorization and the code has several branches. Is there some MSVC setting that needs to be enabled for auto-vectorization?
Manual SSE vectorization and removal of the branches is a big performance improvement of around 40%. Given that the constraint solver typically takes 45% of the total time, this can be almost 20% overall physics performance. Currently, only contact and friction constraints and the generic 6dof constraint uses this SSE code, but my colleague Roman is helping to make sure all constraint types use this use this SSE code.
The manual optimized SSE version is totally branchless and doesn't use FPU, see the assembly code (SIMD/SSE to the right).
Thanks,
Erwin
Code: Select all
00468005 fld dword ptr [ecx+94h] 0041B4B8 fld dword ptr [ecx+60h]
0046800B fld dword ptr [ecx+98h] 0041B4BB fstp dword ptr [esp+1Ch]
00468011 fmul dword ptr [ecx+60h] 0041B4BF movss xmm0,dword ptr [esp+1Ch]
00468014 fsubp st(1),st 0041B4C5 movaps xmm2,xmmword ptr [edx+10h]
00468016 fstp dword ptr [esp+18h] 0041B4C9 movaps xmm1,xmmword ptr [ecx]
0046801A fld dword ptr [esp+18h] 0041B4CC movaps xmm3,xmmword ptr [ecx+10h]
0046801E fld dword ptr [edx+14h] 0041B4D0 movaps xmm4,xmmword ptr [edx]
00468021 fmul dword ptr [ecx+4] 0041B4D3 movaps xmm6,xmmword ptr [eax+10h]
00468024 fld dword ptr [ecx] 0041B4D7 mulps xmm1,xmm2
00468026 fmul dword ptr [edx+10h] 0041B4DA movaps xmm2,xmm3
00468029 faddp st(1),st 0041B4DD mulps xmm2,xmm4
0046802B fld dword ptr [edx+18h] 0041B4E0 movaps xmm4,xmmword ptr [eax]
0046802E fmul dword ptr [ecx+8] 0041B4E3 mulps xmm3,xmm4
00468031 faddp st(1),st 0041B4E6 movaps xmm4,xmmword ptr [ecx+20h]
00468033 fstp dword ptr [esp+18h] 0041B4EA mulps xmm4,xmm6
00468037 fld dword ptr [esp+18h] 0041B4ED movss xmm6,dword ptr [ecx+80h]
0046803B fld dword ptr [ecx+14h] 0041B4F5 movaps xmm7,xmm2
0046803E fmul dword ptr [edx+4] 0041B4F8 shufps xmm7,xmm2,0AAh
00468041 fld dword ptr [ecx+10h] 0041B4FC shufps xmm6,xmm6,0
00468044 fmul dword ptr [edx] 0041B500 movaps xmmword ptr [esp+180h],xmm6
00468046 faddp st(1),st 0041B508 movaps xmm6,xmm2
00468048 fld dword ptr [ecx+18h] 0041B50B shufps xmm6,xmm2,55h
0046804B fmul dword ptr [edx+8] 0041B50F shufps xmm2,xmm2,0
0046804E faddp st(1),st 0041B513 addps xmm7,xmm6
00468050 fstp dword ptr [esp+18h] 0041B516 addps xmm7,xmm2
00468054 fadd dword ptr [esp+18h] 0041B519 movss xmm5,dword ptr [ecx+0A0h]
00468058 fstp dword ptr [esp+18h] 0041B521 movaps xmm2,xmm1
0046805C fld dword ptr [esp+18h] 0041B524 shufps xmm2,xmm1,0AAh
00468060 fmul dword ptr [ecx+80h] 0041B528 movaps xmm6,xmm1
00468066 fsubp st(1),st 0041B52B shufps xmm6,xmm1,55h
00468068 fstp dword ptr [esp+18h] 0041B52F addps xmm2,xmm6
0046806C fld dword ptr [esp+18h] 0041B532 shufps xmm1,xmm1,0
00468070 fld dword ptr [edi+14h] 0041B536 addps xmm2,xmm1
00468073 fmul dword ptr [ecx+24h] 0041B539 movss xmm1,dword ptr [ecx+94h]
00468076 fld dword ptr [edi+10h] 0041B541 addps xmm7,xmm2
00468079 fmul dword ptr [ecx+20h] 0041B544 mulps xmm7,xmmword ptr [esp+180h]
0046807C faddp st(1),st 0041B54C movss xmm2,dword ptr [ecx+98h]
0046807E fld dword ptr [edi+18h] 0041B554 shufps xmm2,xmm2,0
00468081 fmul dword ptr [ecx+28h] 0041B558 shufps xmm0,xmm0,0
00468084 faddp st(1),st 0041B55C movaps xmmword ptr [esp+30h],xmm0
00468086 fstp dword ptr [esp+18h] 0041B561 movaps xmm6,xmmword ptr [esp+30h]
0046808A fld dword ptr [esp+18h] 0041B566 movss xmm0,dword ptr [ecx+9Ch]
0046808E fld dword ptr [ecx+14h] 0041B56E mulps xmm2,xmm6
00468091 fmul dword ptr [edi+4] 0041B571 shufps xmm1,xmm1,0
00468094 fld dword ptr [edi] 0041B575 subps xmm1,xmm2
00468096 fmul dword ptr [ecx+10h] 0041B578 subps xmm1,xmm7
00468099 faddp st(1),st 0041B57B movaps xmm2,xmm4
0046809B fld dword ptr [ecx+18h] 0041B57E shufps xmm2,xmm4,0AAh
0046809E fmul dword ptr [edi+8] 0041B582 movaps xmm7,xmm4
004680A1 faddp st(1),st 0041B585 shufps xmm7,xmm4,55h
004680A3 fstp dword ptr [esp+18h] 0041B589 addps xmm2,xmm7
004680A7 fsub dword ptr [esp+18h] 0041B58C shufps xmm4,xmm4,0
004680AB fstp dword ptr [esp+18h] 0041B590 addps xmm2,xmm4
004680AF fld dword ptr [esp+18h] 0041B593 movaps xmm4,xmm3
004680B3 fmul dword ptr [ecx+80h] 0041B596 shufps xmm4,xmm3,0AAh
004680B9 fsubp st(1),st 0041B59A movaps xmm7,xmm3
004680BB fstp dword ptr [esp+20h] 0041B59D shufps xmm7,xmm3,55h
004680BF fld dword ptr [esp+20h] 0041B5A1 shufps xmm3,xmm3,0
004680C3 fld st(0) 0041B5A5 addps xmm4,xmm7
004680C5 fadd dword ptr [ecx+60h] 0041B5A8 addps xmm4,xmm3
004680C8 fstp dword ptr [esp+18h] 0041B5AB subps xmm2,xmm4
004680CC fld dword ptr [esp+18h] 0041B5AE mulps xmm2,xmmword ptr [esp+180h]
004680D0 fld dword ptr [ecx+9Ch] 0041B5B6 subps xmm1,xmm2
004680D6 fcomp st(1) 0041B5B9 movaps xmm7,xmm1
004680D8 fnstsw ax 0041B5BC movaps xmm3,xmm7
004680DA test ah,41h 0041B5BF shufps xmm0,xmm0,0
004680DD jne 004680FE 0041B5C3 shufps xmm5,xmm5,0
004680DF movss xmm0,dword ptr [ecx+9Ch] 0041B5C7 addps xmm3,xmm6
004680E7 fstp st(1) 0041B5CA movaps xmm1,xmm3
004680E9 fstp st(0) 0041B5CD cmpltps xmm1,xmm0
004680EB fld dword ptr [ecx+9Ch] 0041B5D1 movaps xmm4,xmm1
004680F1 fsub dword ptr [ecx+60h] 0041B5D4 andnps xmm4,xmm3
004680F4 fstp dword ptr [esp+20h] 0041B5D7 movaps xmm2,xmm3
004680F8 fld dword ptr [esp+20h] 0041B5DA movaps xmm3,xmm1
004680FC jmp 00468130 0041B5DD andps xmm3,xmm0
004680FE fld dword ptr [ecx+0A0h] 0041B5E0 orps xmm4,xmm3
00468104 fcompp 0041B5E3 movups xmmword ptr [ecx+60h],xmm4
00468106 fnstsw ax 0041B5E7 movaps xmm4,xmmword ptr [ecx+60h]
00468108 test ah,5 0041B5EB cmpltps xmm2,xmm5
0046810B jp 0046812A 0041B5EF movaps xmm3,xmm2
0046810D movss xmm0,dword ptr [ecx+0A0h] 0041B5F2 andps xmm3,xmm4
00468115 fstp st(0) 0041B5F5 movaps xmm4,xmm2
00468117 fld dword ptr [ecx+0A0h] 0041B5F8 subps xmm0,xmm6
0046811D fsub dword ptr [ecx+60h] 0041B5FB andps xmm0,xmm1
00468120 fstp dword ptr [esp+20h] 0041B5FE andnps xmm1,xmm7
00468124 fld dword ptr [esp+20h] 0041B601 orps xmm0,xmm1
00468128 jmp 00468130 0041B604 andps xmm0,xmm2
0046812A movss xmm0,dword ptr [esp+18h] 0041B607 andnps xmm4,xmm5
00468130 shufps xmm0,xmm0,0 0041B60A orps xmm3,xmm4
00468134 fld st(1) 0041B60D movaps xmm4,xmmword ptr [ecx+10h]
00468136 movups xmmword ptr [ecx+60h],xmm0 0041B611 movups xmmword ptr [ecx+60h],xmm3
0046813A fcomp dword ptr [edx+24h] 0041B615 movss xmm1,dword ptr [edx+24h]
0046813D fnstsw ax 0041B61A movss xmm3,dword ptr [eax+24h]
0046813F test ah,44h 0041B61F shufps xmm1,xmm1,0
00468142 jnp 0046821A 0041B623 mulps xmm1,xmm4
00468148 fld dword ptr [edx+24h] 0041B626 subps xmm5,xmm6
0046814B fmul dword ptr [ecx+10h] 0041B629 andnps xmm2,xmm5
0046814E fstp dword ptr [esp+1F0h] 0041B62C orps xmm0,xmm2
00468155 fld dword ptr [ecx+14h] 0041B62F movaps xmm2,xmmword ptr [edx]
00468158 fmul dword ptr [edx+24h] 0041B632 mulps xmm1,xmm0
0046815B fstp dword ptr [esp+1F4h] 0041B635 addps xmm1,xmm2
00468162 fld dword ptr [ecx+18h] 0041B638 movaps xmm2,xmmword ptr [edx+10h]
00468165 fmul dword ptr [edx+24h] 0041B63C movaps xmmword ptr [edx],xmm1
00468168 fstp dword ptr [esp+1F8h] 0041B63F movaps xmm1,xmmword ptr [ecx+30h]
0046816F fld dword ptr [esp+1F0h] 0041B643 mulps xmm1,xmm0
00468176 fmul st,st(1) 0041B646 addps xmm1,xmm2
00468178 fstp dword ptr [esp+130h] 0041B649 movaps xmmword ptr [edx+10h],xmm1
0046817F fld st(0) 0041B64D movaps xmm2,xmmword ptr [eax]
00468181 fmul dword ptr [esp+1F4h] 0041B650 shufps xmm3,xmm3,0
00468188 fstp dword ptr [esp+134h] 0041B654 movaps xmm1,xmm0
0046818F fld st(0) 0041B657 mulps xmm3,xmm4
00468191 fmul dword ptr [esp+1F8h] 0041B65A mulps xmm1,xmm3
00468198 fstp dword ptr [esp+138h] 0041B65D subps xmm2,xmm1
0046819F fld dword ptr [esp+130h] 0041B660 movaps xmmword ptr [eax],xmm2
004681A6 fadd dword ptr [edx] 0041B663 movaps xmm1,xmmword ptr [ecx+40h]
004681A8 fstp dword ptr [edx] 0041B667 mulps xmm1,xmm0
004681AA fld dword ptr [edx+4] 0041B66A movaps xmm0,xmmword ptr [eax+10h]
004681AD fadd dword ptr [esp+134h] 0041B66E addps xmm1,xmm0
004681B4 fstp dword ptr [edx+4] 0041B671 movaps xmmword ptr [eax+10h],xmm1
004681B7 fld dword ptr [esp+138h]
004681BE fadd dword ptr [edx+8]
004681C1 fstp dword ptr [edx+8]
004681C4 fld dword ptr [edx+20h]
004681C7 fmul st,st(1)
004681C9 fstp dword ptr [esp+18h]
004681CD fld dword ptr [esp+18h]
004681D1 fld st(0)
004681D3 fmul dword ptr [ecx+30h]
004681D6 fstp dword ptr [esp+290h]
004681DD fld dword ptr [ecx+34h]
004681E0 fmul st,st(1)
004681E2 fstp dword ptr [esp+294h]
004681E9 fmul dword ptr [ecx+38h]
004681EC fstp dword ptr [esp+298h]
004681F3 fld dword ptr [esp+290h]
004681FA fadd dword ptr [edx+10h]
004681FD fstp dword ptr [edx+10h]
00468200 fld dword ptr [edx+14h]
00468203 fadd dword ptr [esp+294h]
0046820A fstp dword ptr [edx+14h]
0046820D fld dword ptr [esp+298h]
00468214 fadd dword ptr [edx+18h]
00468217 fstp dword ptr [edx+18h]
0046821A fld st(1)
0046821C fcomp dword ptr [edi+24h]
0046821F fnstsw ax
00468221 test ah,44h
00468224 jnp 0046831A
0046822A fld dword ptr [ecx+10h]
0046822D fchs
0046822F fstp dword ptr [esp+30h]
00468233 fld dword ptr [ecx+14h]
00468236 fchs
00468238 fstp dword ptr [esp+34h]
0046823C fld dword ptr [ecx+18h]
0046823F fchs
00468241 fstp dword ptr [esp+38h]
00468245 fld dword ptr [esp+30h]
00468249 fmul dword ptr [edi+24h]
0046824C fstp dword ptr [esp+210h]
00468253 fld dword ptr [esp+34h]
00468257 fmul dword ptr [edi+24h]
0046825A fstp dword ptr [esp+214h]
00468261 fld dword ptr [esp+38h]
00468265 fmul dword ptr [edi+24h]
00468268 fstp dword ptr [esp+218h]
0046826F fld dword ptr [esp+210h]
00468276 fmul st,st(1)
00468278 fstp dword ptr [esp+170h]
0046827F fld st(0)
00468281 fmul dword ptr [esp+214h]
00468288 fstp dword ptr [esp+174h]
0046828F fld st(0)
00468291 fmul dword ptr [esp+218h]
00468298 fstp dword ptr [esp+178h]
0046829F fld dword ptr [edi]
004682A1 fadd dword ptr [esp+170h]
004682A8 fstp dword ptr [edi]
004682AA fld dword ptr [esp+174h]
004682B1 fadd dword ptr [edi+4]
004682B4 fstp dword ptr [edi+4]
004682B7 fld dword ptr [edi+8]
004682BA fadd dword ptr [esp+178h]
004682C1 fstp dword ptr [edi+8]
004682C4 fmul dword ptr [edi+20h]
004682C7 fstp dword ptr [esp+18h]
004682CB fld dword ptr [esp+18h]
004682CF fld st(0)
004682D1 fmul dword ptr [ecx+40h]
004682D4 fstp dword ptr [esp+270h]
004682DB fld dword ptr [ecx+44h]
004682DE fmul st,st(1)
004682E0 fstp dword ptr [esp+274h]
004682E7 fmul dword ptr [ecx+48h]
004682EA fstp dword ptr [esp+278h]
004682F1 fld dword ptr [esp+270h]
004682F8 fadd dword ptr [edi+10h]
004682FB fstp dword ptr [edi+10h]
004682FE fld dword ptr [edi+14h]
00468301 fadd dword ptr [esp+274h]
00468308 fstp dword ptr [edi+14h]
0046830B fld dword ptr [esp+278h]
00468312 fadd dword ptr [edi+18h]
00468315 fstp dword ptr [edi+18h]
00468318 jmp 0046831C
0046831A fstp st(0)
Code: Select all
// Project Gauss Seidel or the equivalent Sequential Impulse
SIMD_FORCE_INLINE void btSequentialImpulseConstraintSolver::resolveSingleConstraintRowGenericSIMD(btSolverBody& body1,btSolverBody& body2,const btSolverConstraint& c)
{
#ifdef USE_SIMD
_asm int 3;
_asm int 3;
__m128 cpAppliedImp = _mm_set1_ps(c.m_appliedImpulse);
__m128 lowerLimit1 = _mm_set1_ps(c.m_lowerLimit);
__m128 upperLimit1 = _mm_set1_ps(c.m_upperLimit);
__m128 deltaImpulse = _mm_sub_ps(_mm_set1_ps(c.m_rhs), _mm_mul_ps(_mm_set1_ps(c.m_appliedImpulse),_mm_set1_ps(c.m_cfm)));
__m128 deltaVel1Dotn = _mm_add_ps(_vmathVfDot3(c.m_contactNormal.mVec128,body1.m_deltaLinearVelocity.mVec128), _vmathVfDot3(c.m_relpos1CrossNormal.mVec128,body1.m_deltaAngularVelocity.mVec128));
__m128 deltaVel2Dotn = _mm_sub_ps(_vmathVfDot3(c.m_relpos2CrossNormal.mVec128,body2.m_deltaAngularVelocity.mVec128),_vmathVfDot3((c.m_contactNormal).mVec128,body2.m_deltaLinearVelocity.mVec128));
deltaImpulse = _mm_sub_ps(deltaImpulse,_mm_mul_ps(deltaVel1Dotn,_mm_set1_ps(c.m_jacDiagABInv)));
deltaImpulse = _mm_sub_ps(deltaImpulse,_mm_mul_ps(deltaVel2Dotn,_mm_set1_ps(c.m_jacDiagABInv)));
btSimdScalar sum = _mm_add_ps(cpAppliedImp,deltaImpulse);
btSimdScalar resultLowerLess,resultUpperLess;
resultLowerLess = _mm_cmplt_ps(sum,lowerLimit1);
resultUpperLess = _mm_cmplt_ps(sum,upperLimit1);
__m128 lowMinApplied = _mm_sub_ps(lowerLimit1,cpAppliedImp);
deltaImpulse = _mm_or_ps( _mm_and_ps(resultLowerLess, lowMinApplied), _mm_andnot_ps(resultLowerLess, deltaImpulse) );
c.m_appliedImpulse = _mm_or_ps( _mm_and_ps(resultLowerLess, lowerLimit1), _mm_andnot_ps(resultLowerLess, sum) );
__m128 upperMinApplied = _mm_sub_ps(upperLimit1,cpAppliedImp);
deltaImpulse = _mm_or_ps( _mm_and_ps(resultUpperLess, deltaImpulse), _mm_andnot_ps(resultUpperLess, upperMinApplied) );
c.m_appliedImpulse = _mm_or_ps( _mm_and_ps(resultUpperLess, c.m_appliedImpulse), _mm_andnot_ps(resultUpperLess, upperLimit1) );
__m128 linearComponentA = _mm_mul_ps(c.m_contactNormal.mVec128,_mm_set1_ps(body1.m_invMass));
__m128 linearComponentB = _mm_mul_ps((c.m_contactNormal).mVec128,_mm_set1_ps(body2.m_invMass));
__m128 impulseMagnitude = deltaImpulse;
body1.m_deltaLinearVelocity.mVec128 = _mm_add_ps(body1.m_deltaLinearVelocity.mVec128,_mm_mul_ps(linearComponentA,impulseMagnitude));
body1.m_deltaAngularVelocity.mVec128 = _mm_add_ps(body1.m_deltaAngularVelocity.mVec128 ,_mm_mul_ps(c.m_angularComponentA.mVec128,impulseMagnitude));
body2.m_deltaLinearVelocity.mVec128 = _mm_sub_ps(body2.m_deltaLinearVelocity.mVec128,_mm_mul_ps(linearComponentB,impulseMagnitude));
body2.m_deltaAngularVelocity.mVec128 = _mm_add_ps(body2.m_deltaAngularVelocity.mVec128 ,_mm_mul_ps(c.m_angularComponentB.mVec128,impulseMagnitude));
_asm int 3;
_asm int 3;
#else
resolveSingleConstraintRowGeneric(body1,body2,c);
#endif
}
// Project Gauss Seidel or the equivalent Sequential Impulse
SIMD_FORCE_INLINE void btSequentialImpulseConstraintSolver::resolveSingleConstraintRowGeneric(btSolverBody& body1,btSolverBody& body2,const btSolverConstraint& c)
{
btScalar deltaImpulse = c.m_rhs-btScalar(c.m_appliedImpulse)*c.m_cfm;
const btScalar deltaVel1Dotn = c.m_contactNormal.dot(body1.m_deltaLinearVelocity) + c.m_relpos1CrossNormal.dot(body1.m_deltaAngularVelocity);
const btScalar deltaVel2Dotn = -c.m_contactNormal.dot(body2.m_deltaLinearVelocity) + c.m_relpos2CrossNormal.dot(body2.m_deltaAngularVelocity);
const btScalar delta_rel_vel = deltaVel1Dotn-deltaVel2Dotn;
deltaImpulse -= deltaVel1Dotn*c.m_jacDiagABInv;
deltaImpulse -= deltaVel2Dotn*c.m_jacDiagABInv;
const btScalar sum = btScalar(c.m_appliedImpulse) + deltaImpulse;
if (sum < c.m_lowerLimit)
{
deltaImpulse = c.m_lowerLimit-c.m_appliedImpulse;
c.m_appliedImpulse = c.m_lowerLimit;
}
else if (sum > c.m_upperLimit)
{
deltaImpulse = c.m_upperLimit-c.m_appliedImpulse;
c.m_appliedImpulse = c.m_upperLimit;
}
else
{
c.m_appliedImpulse = sum;
}
if (body1.m_invMass)
body1.applyImpulse(c.m_contactNormal*body1.m_invMass,c.m_angularComponentA,deltaImpulse);
if (body2.m_invMass)
body2.applyImpulse(-c.m_contactNormal*body2.m_invMass,c.m_angularComponentB,deltaImpulse);
}
- Dragonlord
- Posts: 198
- Joined: Mon Sep 04, 2006 5:31 pm
- Location: Switzerland
- Contact:
Re: Bullet 2.73 has been updated to Bullet 2.73 SP1.
I have not tried this out yet. I just read about this on various places dealing with the question on how to do SSE with GCC and the main answer had been to not do it since the optimizer is already aggressive. Getting some comparison of this though would be interesting. I'm not sure right now if there is a suitable demo app in the distribution of Bullet right now which could be modified to make a direct comparison of a the code with manual optimization and one using GCC alone. I can try once to see how it compares but most probably not in the next days. Furthermore I have only gcc-4.1.2 for testing here although gcc-4.3.2 is the most recent one ( all masked on portage so far ). The SSE abilities ( -ftree-vectorize and company ) though exists there already. Would be interesting to see how the different compilers fare with the different codes.
Re: Bullet 2.73 has been updated to Bullet 2.73 SP1.
With VS 2005 (and I'm sure it's the same for 2008 (and probably 2003)) use this:Is there some MSVC setting that needs to be enabled for auto-vectorization?
/arch:SSE OR /arch:SSE2
AND
/fp:fast (instead of /fp:precise)
If you don't set /fp:fast then it will block some of the SSE/SSE2 optimizations.
These settings cause the compiler to use SSE in a lot of cases, however it's still not the greatest at vectorizing compared to hand done code.
One bonus is that float/int conversions use SSE instructions which are faster.