Is SIMD math worth the effort??

projectileman · Post by **projectileman** » Sat Oct 06, 2007 8:29 am

Hi.

I was comparing the performance with intensive vector-matrix operations with Bullet linear math and then with the Sony's SIMD Vectormath library which comes in extras folder.

In my machine (a 32bit Pentium 4) with SSE configuration I've found that the Bullet linear math is often faster (1.5 - 2 times faster) than Sony's SIMD Vectormath library, even if it claims that it is using SIMD.

Here is the code of the test demo:

Code: Select all


///Testfile to test differences between vectormath and Bullet LinearMath


#include "vectormath_aos.h"


#include "LinearMath/btTransform.h"
#include <stdio.h>
#include <stdlib.h>

#include "GL/glfw.h"

#define NUM_TESTS 100
#define NUM_OPERATIONS 10000

//Bullet, a btVector can be used for both points and vectors.
//it is up to the user/developer to use the right multiplication: btTransform for points, and btQuaternion or btMatrix3x3 for vectors.
void	BulletTest()
{
	btTransform	tr;
	tr.setIdentity();

	tr.setOrigin(btVector3(10,0,0));
	//initialization
	btVector3	pointA(0,0,0);
	btVector3	pointB,pointC,pointD,pointE;
		
	btScalar	x;
	int i = NUM_OPERATIONS;
	while(i--)
	{		
		//transform over tr
		pointB = tr(pointA);
	
		//inverse transform
		pointC = tr.inverse() * pointA;

		//dot product
		x = pointD.dot(pointE);
		//square length
		x = pointD.length2();
		//length
		x = pointD.length();		
		//in-place normalize pointD
		pointD.normalize();
	}
	
}


//vectormath makes a difference between point and vector.
void	VectormathTest()
{

	Vectormath::Aos::Transform3 tr = Vectormath::Aos::Transform3::identity();
	tr.setTranslation(Vectormath::Aos::Vector3(10,0,0));
	//initialization
	Vectormath::Aos::Point3	pointA(0,0,0);
	Vectormath::Aos::Point3	pointB,pointC,pointE;
	Vectormath::Aos::Vector3 pointD;
	
	btScalar	x;
	int i = NUM_OPERATIONS;
	while(i--)
	{

		//transform over tr
		pointB = tr * pointA;

		//transform over tr	
		//inverse transform
		pointC = Vectormath::Aos::inverse(tr) * pointA;

		//dot product
		x = Vectormath::Aos::dot(Vectormath::Aos::Vector3(pointD),Vectormath::Aos::Vector3(pointE));
		//square length
		x = Vectormath::Aos::lengthSqr(Vectormath::Aos::Vector3(pointD));
		//length
		x = Vectormath::Aos::length(Vectormath::Aos::Vector3(pointD));
		
		//in-place normalize pointD
		pointD = Vectormath::Aos::normalize(Vectormath::Aos::Vector3(pointD));
	}	
}



int main()
{
	glfwInit( );
	
	
	double start_time;
	int i;
	double end_time;

	start_time = glfwGetTime();	

	
	
	printf("Elapsed time : %f ms",end_time-start_time);

	{
		printf("\n \n Vectormath\n");
		start_time = glfwGetTime();
		i = NUM_TESTS;
		while(i--)
		{
			VectormathTest();		
		}	
		end_time = glfwGetTime();
	}

	printf("Elapsed time : %f ms",end_time-start_time);


	{
		printf("\n \n Bullet Linearmath\n");
		start_time = glfwGetTime();
		i = NUM_TESTS;
		while(i--)
		{
			BulletTest();		
		}	
		end_time = glfwGetTime();
	}
	printf("Elapsed time : %f ms",end_time-start_time);


	getchar();

	glfwTerminate( );

	return 0;
}

For compile it, you must have glfw for timming routines.

Really I don't know if I'm doing it wrong, but with these simple tests (whose include only vector multiplications and matrix transformations) is demonstrated that writting SIMD code doesn't worth the effort!

I have done these kind of tests before, some months ago, with other SIMD libraries such nvec and SIMDx86 and my personal implentation with macros, for finding out that those SIMD libraries are only a bloated code.

With this I can verify that the KISS (keep it simple stupid!! )rule applies to critical performance applications.

Sometime I heard from a programmer a very wise proverb, as he said: "The compiler is always smarter than you... just TRUST IT and don't make any stupid!!" so the compiler does optimizations in an smarter way, and I could be sure that it will use SIMD code even if I didn't ask for it.

DevO · Post by **DevO** » Sat Oct 06, 2007 11:56 am

Hi.

are you using Visual C++ 2005 ?
Have you enabled "Streaming SIMD Extensions 2 (/arch:SSE2)" in the options?

Well I have tried to use SIMD optimization too and in some cases it can make you code run twice as fast.
But some times it is even slower.
SIMD is not good for all things.

Things line Dot, Length, Normalize can not be coded well with SSE1 or SSE2 and can be even slower as without it.

Code: Select all

//No SSE 
inline float Dot(const Vector &v0, const Vector &v1)	
{ 
	return (v0.x*v1.x + v0.y*v1.y + v0.z*v1.z); 
}

// SSE2
inline __m128 DotSSE2(const VectorSIMD &v0,const VectorSIMD &v1)	
{
	__m128 a = _mm_mul_ps(v0.m128, v1.m128);
	return _mm_add_ss(_mm_shuffle_ps(a, a, _MM_SHUFFLE(0,0,0,0)), _mm_add_ss(_mm_shuffle_ps(a, a, _MM_SHUFFLE(1,1,1,1)), _mm_shuffle_ps(a, a, _MM_SHUFFLE(2,2,2,2))));
}

// SSE3
inline __m128 DotSSE3(const VectorSIMD &v0,const VectorSIMD &v1)	
{
	__m128 vec1;
	vec1 = _mm_mul_ps(v0.m128, v1.m128);
	vec1 = _mm_hadd_ps(vec1, vec1);
	vec1 = _mm_hadd_ps(vec1, vec1);	
	return vec1;
}

Probably only SSE3 is a bit faster as without SSE at all.

Another thing that must be avoid is to mix SIMD (__m128) and float like you do with btScalar x; in you tests.

projectileman · Post by **projectileman** » Sat Oct 06, 2007 2:21 pm

I've ran that code with Visual C++ 2005 and tried all options (/arch:SSE2,/arch:SSE and without any instruction set), and in all cases Bullet Math library remains stable and faster while the vector math runs slower and fluctues.

May be mixing SIMD (__m128) and non aligned data are causing that poor performance as you've said. However, I've tried to replace that btScalar x variable with a SIMD data type, as shown in this code:

Code: Select all

//vectormath makes a difference between point and vector.
void	VectormathTest()
{
	int i = NUM_OPERATIONS;

	Vectormath::Aos::Transform3 tr = Vectormath::Aos::Transform3::identity();
	tr.setTranslation(Vectormath::Aos::Vector3(10,0,0));
	//initialization
	Vectormath::Aos::Point3	pointA(0,0,0);
	Vectormath::Aos::Point3	pointB,pointC,pointE;
	Vectormath::Aos::Vector3 pointD;

	Vectormath::Aos::Point3	x;

	while(i--)
	{

		//transform over tr
		pointB = tr * pointA;

		//transform over tr
		//inverse transform
		pointC = Vectormath::Aos::inverse(tr) * pointA;

		//dot product
		x[0] = Vectormath::Aos::dot(Vectormath::Aos::Vector3(pointD),Vectormath::Aos::Vector3(pointE));
		//square length
		x[0] = Vectormath::Aos::lengthSqr(Vectormath::Aos::Vector3(pointD));
		//length
		x[0] = Vectormath::Aos::length(Vectormath::Aos::Vector3(pointD));

		//in-place normalize pointD
		pointD = Vectormath::Aos::normalize(Vectormath::Aos::Vector3(pointD));
	}
}

Just for getting the same poor results.

I don't have a SSE3 hardware so I can't determine if that SIMD instruction has a positive effect.

It seems that SIMD is a feature that is not well suited for programmers like me. Really I don't know how getting any advantage of this, and the x86 architecture doesn't offer a reliable solution for this, when comparing with Mac PowerPC and SPU in PS3: x86 doesn't support Structure of Arrays Processing which could be more efficient.

It's better not to try coding SIMD data if really you don't know exactly what's happen in your machine.

Erwin Coumans · Post by **Erwin Coumans** » Sat Oct 06, 2007 3:57 pm

It's better not to try coding SIMD data if really you don't know exactly what's happen in your machine.

That is true, you need to know on the assembly/intrinsic/registry level, to be able to benefit from SIMD.

x86 doesn't support Structure of Arrays Processing which could be more efficient.

'Structure of Arrays' and 'Array of Structures' is a choice of the developer, not the hardware. You can reorganize the data and algorithm to process the data in SOA or AOS way. This can be very difficult, so if you are still interested, I can recommend reading this paper based on experiences with SIMD by Jan Paul van Waveren, from idSoftware. It explains how they optimized their animation for SIMD:
http://www.intel.com/cd/ids/developer/a ... 293451.htm

A few other tips here:
- please try the 'floatInVec' to store the scalar result, that keeps it in the SIMD register.
- it is best to optimize a bigger part of code/loop entirely in SIMD, do a lot of profiling, and review the assembly code that the compiler generates, and if necessary modify the code, and repeat. Whenever you convert from SIMD to scalar you loose performance.
- the dot product requires rewriting of the code to be faster, because the horizontal add is not SIMD friendly. It might be better to do 4 or 16 dot products at the same time, instead of a single dot product.

Hope this helps,
Erwin

DevO · Post by **DevO** » Sat Oct 06, 2007 4:34 pm

Well SIMD programming is a bit different, you need to know where it make sense to use SIMD and where not.

Using SSE only for dot(), length(), normalize() is not well because all this operations can not be done in parallel.
But SSE is well for many Vector additions, multiplications and such stuff.
Of course the best speedup you will get fro Vector4D

Here is simple test I have made in one of the Bullet Demos.

Code: Select all

///SIMD test --------------------------
///SIMD test --------------------------
	m_idle = true;

#define NUM_OPERATIONS 10000000

/// SSE
	{
		Vector3 sum = Vector3(0.0,0.0,0.0);

		Vector3 v1 = Vector3(1.11,2.12,3.88);
		Vector3 v2 = Vector3(3.11,2.12,1.88);
		Vector3 v3 = Vector3(0.2,0.4,0.5);

		m_clock.reset();
		int i = NUM_OPERATIONS / 4;
		//int i = NUM_OPERATIONS;
		while(i--)
		{
			sum += v1 - v2 + v3;
			sum += v1 - v2 + v3;
			sum += v1 - v2 + v3;
			sum += v1 - v2 + v3;
		}
		unsigned long int time = m_clock.getTimeMicroseconds();
		
		printf("res %f, %f, %f \n",float(sum.getX()),float(sum.getY()),float(sum.getZ()));
		printf("SSE time %i us \n",time);
	}


/// BT
	{
		btVector3 sum = btVector3(0.0,0.0,0.0);

		btVector3 v1 = btVector3(1.11,2.12,3.88);
		btVector3 v2 = btVector3(3.11,2.12,1.88);
		btVector3 v3 = btVector3(0.2,0.4,0.5);

		m_clock.reset();
		int i = NUM_OPERATIONS / 4;
		//int i = NUM_OPERATIONS;
		while(i--)
		{
			sum += v1 - v2 + v3;
			sum += v1 - v2 + v3;
			sum += v1 - v2 + v3;
			sum += v1 - v2 + v3;
		}
		unsigned long int time = m_clock.getTimeMicroseconds();
		printf("res %f, %f, %f \n",sum.x(),sum.y(),sum.z());
		printf("Bt time %i us \n",time);
	}

///SIMD test --------------------------

On my Core-2 Duo with 2.40 GHz am getting following results.
SSE 15.715 Microseconds
Bullet 37,842 Microseconds

So you can get 2.5 x speed up in this very simple case.
And do not forget about manual loop unrolling

At least Visual C++ 2005 compiler can not do it for you, not sure about GCC.

P.S.:
Can you please comment this thread about GIMPACT3 ?
http://www.bulletphysics.com/Bullet/php ... f=9&t=1529

P.P.S:
I have also mode one small mistake firs and using vectormath_aos.h without define __SSE__.

projectileman · Post by **projectileman** » Sun Oct 07, 2007 3:04 pm

Hi. Guess what...I've fixed the problem.

I don't know what was happening before, but know I get running the Sony's Vector Math Library faster that Bullet Linear Math.

My apologies, I was wrong. I have to check what I was doing bad. But this time I've written a more organized and specialized test applications, which guaranties that Vector data is allocated in aligned memory. Now it's using the btAlignedObjectArray class for allocating a vector set.

In my testbed the Sony's Vector Math Library runs more stable and faster than Bullet Linear Math, and its performance doesn't degrade; even in horizontal operations (like dot product, lenght, cross product) is noticeable faster than Bullet, and it's exceptionally fast in vector normalization.

Also works fine when mixing single scalar values with SIMD vector registers, so I cannot determine what was causing the slow down with that library.

I'm sending you my testbed application code for helping you making your own conclusions.

main_vmtest.zip

vectormath_testbed_msvc_binary.zip

I have to do more tests with quaternions, matrices, and conditional operations (like Min, Max, Clamp). I was planning using Vector Math Library for the next GIMPACT version, so I will use it confidently when I could getting sure that such library would increase the performance in most cases.

DevO · Post by **DevO** » Sun Oct 07, 2007 4:33 pm

Hi,

thanks!
Have you do some Win64 bit tests?
In mu tests with you code Bullet Math in 32-bit mode is 2x slower as 64-bit.
Sony's Vector Math is equally fast in both cases 32 and 64 bit!
Why this difference ???

Test number 4 in 64-bit mode: Sony's Vector Math is about 3.6x faster as Bullet Math.
Test number 4 in 32-bit mode: Sony's Vector Math is about 8.0x faster as Bullet Math.

Well but with Floating Point Model set to Fast (/fp:fast) 32-bit Bullet Math is as fast as 64-bit.
There is no speed difference if /fp:fast is set in 64-bit mode.
All this is really strange.

But well SIMD seems to be faster!
How hard it will be to use Sony's Vector Math in Bullet instate of its own???

bone · Post by **bone** » Mon Oct 08, 2007 7:41 pm

I spent a couple weeks trying to improve the performance of my math library using SSE2. In the end, I simply gave up because the loads into the XMM registers took so long they usually killed any performance benefit. My 3-float vectors matched or beat aligned 4-float vectors in almost all cases (yes I was comparing disassembly to verify nothing unusual like unnecessary shuffling was happening). Now this is with my Core 2 (E6600) ... my understanding is that the loading is faster in later chips, so in the near future I think we'll start seeing the supposed benefits of all this SSE stuff.

I'll agree that batch operations can be significantly sped up with SSE2 right now, but if you just need to compute a quick general DotProduct at an arbitrary place in your code, you're not going to see much if any improvement.

projectileman · Post by **projectileman** » Tue Oct 09, 2007 11:37 am

I spent a couple weeks trying to improve the performance of my math library using SSE2. In the end, I simply gave up because the loads into the XMM registers took so long they usually killed any performance benefit

I'll agree that batch operations can be significantly sped up with SSE2 right now, but if you just need to compute a quick general DotProduct at an arbitrary place in your code, you're not going to see much if any improvement.

May be you're right. Also I've perceived that using SSE doesn't gives an extraordinary performance as I expected.
However, at least I get a 10%-20% gaining in performance at most cases, and with operations such square roots and normalizations I could get a 250%-400% in speed.

When operating over SSE is very important working everithing with 128 registers, or the contrary you'll expermient a serious slow down.

Real-Time Physics Simulation Forum

Is SIMD math worth the effort??

Is SIMD math worth the effort??

Re: Is SIMD math worth the effort??

Re: Is SIMD math worth the effort??

Re: Is SIMD math worth the effort??

Re: Is SIMD math worth the effort??

Re: Is SIMD math worth the effort??

Re: Is SIMD math worth the effort??

Re: Is SIMD math worth the effort??

Re: Is SIMD math worth the effort??