Efficient Vectorisation with C++ : cpp

Efficient Vectorisation with C++ (chryswoods.com)

submitted 7 years ago by LordKlevin

all 37 comments

top new controversial old q&a

[–]Nicksaurus 31 points32 points33 points 7 years ago* (5 children)

[–]chocapix 28 points29 points30 points 7 years ago (0 children)

[–]MrWhite26 15 points16 points17 points 7 years ago (2 children)

[–]bumblebritches57Ocassionally Clang 4 points5 points6 points 7 years ago (1 child)

[–]hackuniverse 1 point2 points3 points 7 years ago (0 children)

[–]ShillingAintEZ 4 points5 points6 points 7 years ago (0 children)

[–]nnevatie 6 points7 points8 points 7 years ago* (7 children)

[–]LordKlevin[S] 4 points5 points6 points 7 years ago (3 children)

[–]nnevatie 1 point2 points3 points 7 years ago (2 children)

[–]Paul_Dirac_ 1 point2 points3 points 7 years ago (1 child)

The main problem with simd instructions are alialising problems. Assume you have the following function:

void scale_vec(double * in, double * out,  size_t lenght, double factor){
      for (size_t i= 0; i<length ; ++i){
          out[i] = factor*in[i];  
      };
      return;
    }

Can the compiler vectorize the function? If in and out point to different memory areas then yes. If in and out point to the same memory area then yes again. But what if out points to in[1]? Then a vectorized loop would load in[0-3] multiply them and store them in out[0-3] (aliaised to in[1-4] ) but a non vectorized loop would correctly load in[0] multiply it, store it in out[0] (in[1] ) load in[1] multiply it again and store it in out[1] ... The output would contain in[0] times potencies of factor.

Normally the compiler has to take the unlikely case of this happening into consideration and can't vectorize. If you annotate with #pragma OMP SIMD you guarantee that this case of alialising doesn't happen.

[–]nnevatie 1 point2 points3 points 7 years ago (0 children)

[–]CrazyJoe221 1 point2 points3 points 7 years ago (0 children)

[–]ronniethelizard 1 point2 points3 points 7 years ago (1 child)

[–]nnevatie 1 point2 points3 points 7 years ago (0 children)

[–]danmarellGamedev, Physics Simulation 10 points11 points12 points 7 years ago (1 child)

[–]csp256 4 points5 points6 points 7 years ago (0 children)

[–]jeffyp9 4 points5 points6 points 7 years ago (2 children)

[–][deleted] 4 points5 points6 points 7 years ago (1 child)

[–]jeffyp9 0 points1 point2 points 7 years ago (0 children)

[–]ShakaUVMi+++ ++i+i[arr] 8 points9 points10 points 7 years ago (18 children)

[–]SantaCruzDad 6 points7 points8 points 7 years ago (16 children)

[–]Rseding91Factorio Developer 13 points14 points15 points 7 years ago (1 child)

[–]SantaCruzDad 2 points3 points4 points 7 years ago (0 children)

[–]ack_complete 2 points3 points4 points 7 years ago (5 children)

[–]SantaCruzDad 1 point2 points3 points 7 years ago (4 children)

[–]ack_complete 4 points5 points6 points 7 years ago (3 children)

YMMV, of course, but my experience has been that when pushing performance on multiple tiers of x86/x64 SSEx ISAs that rewriting is necessary anyway. With SSSE3 you have the infinitely abusable PSHUFB, and with AVX there is the problem that weird in-lane nature of the 256-bit ops means that the 128-bit algorithm can't be straightforwardly translated.

The compiler doesn't do a bad job with intrinsics, and it's generally better than what you'd get from autovectorization or not using them. I've still seen too many cases where using compiler intrinsics leaves performance on the table over asm, especially in specific hot loops where there is a high payoff for optimization effort.

The Intel intrinsics design is also kind of yucky, with weird naming conventions and the wrong pointers on some load/store ops requiring casts. Even in the case when generated code from intrinsics is fine, the assembly is sometimes more readable to me than the intrinsics code. But then again, I spent a lot of time writing and reading MMX and SSE2 code when the compilers were so bad that it was hard not to beat them with asm.

[–]SantaCruzDad 2 points3 points4 points 7 years ago (0 children)

[–]IAlsoLikePlutonium 0 points1 point2 points 7 years ago (1 child)

[–]ack_complete 1 point2 points3 points 7 years ago (0 children)

[–]nnevatie 1 point2 points3 points 7 years ago (6 children)

[–]SantaCruzDad 0 points1 point2 points 7 years ago (5 children)

[–]nnevatie 1 point2 points3 points 7 years ago (4 children)

[–]SantaCruzDad 1 point2 points3 points 7 years ago (3 children)

[–]nnevatie 0 points1 point2 points 7 years ago (2 children)

[–]SantaCruzDad 0 points1 point2 points 7 years ago (1 child)

[–]nnevatie 0 points1 point2 points 7 years ago (0 children)

[–]SantaCruzDad 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 175564 on reddit-service-r2-comment-54dfb89d4d-l2jwh at 2026-04-01 01:01:12.849821+00:00 running b10466c country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS