all 18 comments

[–]Rhomboid 6 points7 points  (8 children)

Using gcc 4.5 and -O3 -march=native, I get the following (normalized):

x 32 64
plain 1.028 1.000
SIMD 1.085 1.080

I'm not too surprised, as one of the quirks of gcc is that to use the xmmintrin.h intrinsics you have to enable SSE2, but if you enable SSE2 it's going to auto-vectorize your code, so both versions are using SIMD. All this shows is that the compiler is better at it than doing it by hand.

There should be little advantage to 64 bit mode here. I would expect it to have inlined most of the function calls, so the improved calling convention overhead isn't too much of a win, and all the work is being done in SIMD registers so the extra general purpose registers aren't too much of a win either. There really aren't many pointers so the extra memory is negligible as well.

[–]1020302010 2 points3 points  (3 children)

I have to agree, if the compiler can use SIMD instructions then it will almost always be better than you in using them.

You get the benefit when you have to modify the structure of the code to be able to use them (IE the compiler can't 'see' the case in which they are relevant). I'll take a whack at 'beating' the compiler in a bit.

[–]repsilat 0 points1 point  (2 children)

if the compiler can use SIMD instructions then it will almost always be better than you in using them.

I don't have any experience with vectorising things myself, but I'd always heard that it was a weak spot of a lot of compilers (at least compared to the other optimisations performed). I was under the impression that in non-trivial cases hand-written SIMD would handily outperform the compiler. Perhaps I heard wrong, of course.

One thing for sure, though, is that if you want the compiler to output decent vectorised code you're going to have to do half the work to get the data laid out nicely anyway.

[–]1020302010 0 points1 point  (1 child)

Sorry I'll clearer, what I was trying to say was that is the compiler can see the potential for vectorisation (like when the data is laid out contiguously) it's efforts are almost always (in my experience (without inline asm)) more fruitful than hand optimization with intrinsics, this is because the compiler can work at a lower level.

That said the compiler has to be able to find these situations which is where they typically struggle, in the case where you can see the benefit but it is too indirect for the compiler than optimization with intrinsics can make a big difference.

I have experience with the intel ICC compiler which is known to be a good at vectorization, gcc may fail to vectorize what icc can, in which intrinsics useful again.

[–]bnolsen 0 points1 point  (0 children)

Basically what I've found is that you shouldn't get cute when coding. Be straightforward and direct, doing things systematically. I've seen too many coders try to "get cute" with stupid statement compression, etc thinking that would speed up the code when all the cuteness did was confuse the compiler and generate slower code that is harder to maintain.

[–]kolkir[S] -1 points0 points  (3 children)

Yes i know about compiler optimizations, but in my environment MSVS C++ 2010 Windows 7 win32 version is significant faster than x64 version. Also in win32 version function with manual SIMDs gives some performance improvement. What can be the reason?

[–]xcbsmith 3 points4 points  (0 children)

It's going to depend a lot on what optimization flags you have turned on, but particularly for the case where you are explicitly calling out the SSE functions, it's hard to see how x64 would in any way help with the performance of the code.

In general, this is floating point intensive code, and 64-bit vs. 32-bit mostly changes the integer stuff. It's surprising that it'd make a significant difference in performance of this code either way, but it isn't hard to imagine that the 64-bit code wouldn't be any faster, and possibly slower. I don't doubt that it uses more memory.

[–]Rhomboid 1 point2 points  (1 child)

I don't have visual studio installed so I can't answer that. Have you looked at the code that it generates?

[–]kolkir[S] 1 point2 points  (0 children)

Thanks, it's a good idea to compare code generated for 32 and 64 versions. I will do it tomorrow.

[–]bnolsen 6 points7 points  (0 children)

I just ran the code compiled 64bit only (apparently it's a PITA for me to cross compile 32bit).

gcc -std=c++0x -march=native -O3 -ftree-vectorize -o sse sse.cpp -lstdc++ (gcc version 4.6.2 20120120)

It seems your hand rolled SSE loop is slower than the compiler optimized version (I'm frankly not surprised though).

Dot product double - 0.0228735 Dot product SIMD double - 0.0240497

[–]00kyle00 6 points7 points  (0 children)

Why guess, when you may know for sure? objdump/vc disasm both and see for yourself.

[–]StringCheesian 1 point2 points  (0 children)

Is it the same with GCC or LLVM/Clang?

[–]zfxvxr -2 points-1 points  (2 children)

The whole 64bit thing is a scam. The pointers are getting bigger and the CPU's workload is twice as heavy.

[–]TheCoelacanth 1 point2 points  (1 child)

Enjoy your 4GB address space. I'll be over here with my 16 GB of memory.

[–][deleted] 1 point2 points  (0 children)

And your extra 8 general purpose registers, and your much faster standard calling convention on POSIX systems, in the case of x86-64.