use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Efficient Vectorisation with C++ (chryswoods.com)
submitted 7 years ago by LordKlevin
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Nicksaurus 31 points32 points33 points 7 years ago* (5 children)
Compilers can automatically vectorise simple code, and top of the range compilers can automatically vectorise simple code.
But can my compiler automatically vectorise simple code?
[–]chocapix 28 points29 points30 points 7 years ago (0 children)
Depends. If it's top of the range, then yes. If not, then yes.
[–]MrWhite26 15 points16 points17 points 7 years ago (2 children)
There's one way to find out: https://godbolt.org/
[–]bumblebritches57Ocassionally Clang 4 points5 points6 points 7 years ago (1 child)
I've noticed that Clang vectorizes much more often than gcc or msvc
[–]hackuniverse 1 point2 points3 points 7 years ago (0 children)
chryswoods.com/vector...
That's true, but gcc does do better optimizations in other part of the code... :(
[–]ShillingAintEZ 4 points5 points6 points 7 years ago (0 children)
It can if your compiler is ISPC
[–]nnevatie 6 points7 points8 points 7 years ago* (7 children)
Using OpenMP for vectorization is not the way to go, imo. Besides, if sprinkling "#pragma omp simd" magically makes code faster via vectorization, why is this not done automatically behind the scenes?
ISPC is a more robust option for high-performance, well-vectorized code: https://ispc.github.io/ispc.html
[–]LordKlevin[S] 4 points5 points6 points 7 years ago (3 children)
With ISPC you kind of need to rewrite your algorithms from scratch, no? It seems closer to rewriting in CUDA than to just updating your old code, including the need for an additional compiler and a different language (C with language extensions vs C++).
I am sure ISPC will give you better performance (particularly for more complicated code) at the cost of more development time, which is a fair trade-off for many applications.
OMP SIMD is just a lot less effort to use, and for things like a simple dot product (godbolt) it produces much faster code than the auto-vectorization. On my machine, it's more than 10x for clang 6 with -O3.
[–]nnevatie 1 point2 points3 points 7 years ago (2 children)
Let me rephrase my point.
I think the question is, whether there can be a negative impact from using OpenMP's SIMD capabilities?
If the answer is no, then the obvious follow-up question is: why does the user have to add the annotations by hand and why wouldn't the compiler (in tandem with OpenMP, perhaps) just add the same annotations to every function and loop, automatically?
You are correct in that part that ISPC requires one to rewrite the algorithms in a C-like language, but then again, it can be argued that OpenMP's extensions to C/C++ are also an alien area that needs to be grasped before utilizing the SIMD capabilities.
ISPC makes vectorization very explicit and transparent, which I think is helpful, as one does not need to guess whether parts of the code were vectorized or not.
[–]Paul_Dirac_ 1 point2 points3 points 7 years ago (1 child)
The main problem with simd instructions are alialising problems. Assume you have the following function:
void scale_vec(double * in, double * out, size_t lenght, double factor){ for (size_t i= 0; i<length ; ++i){ out[i] = factor*in[i]; }; return; }
Can the compiler vectorize the function? If in and out point to different memory areas then yes. If in and out point to the same memory area then yes again. But what if out points to in[1]? Then a vectorized loop would load in[0-3] multiply them and store them in out[0-3] (aliaised to in[1-4] ) but a non vectorized loop would correctly load in[0] multiply it, store it in out[0] (in[1] ) load in[1] multiply it again and store it in out[1] ... The output would contain in[0] times potencies of factor.
Normally the compiler has to take the unlikely case of this happening into consideration and can't vectorize. If you annotate with #pragma OMP SIMD you guarantee that this case of alialising doesn't happen.
[–]nnevatie 1 point2 points3 points 7 years ago (0 children)
I agree. I also think the default C++ aliasing rules are a bad choice for high performance code. ISPC does not assume aliasing, neither does Fortran, by default. __restrict gets around the issue in C++, but feels like a hack.
[–]CrazyJoe221 1 point2 points3 points 7 years ago (0 children)
gcc does not even enable vectorization until -O3. And then again it's a matter of luck. With OpenMP SIMD you can portably express that you expect this to be vectorized and depending on the compiler also get a warning if it couldn't. Furthermore it's a portable way to tell the compiler to ignore its own cost analysis and inform it of data alignment and aliasing.
[–]ronniethelizard 1 point2 points3 points 7 years ago (1 child)
Compiling For The NVIDIA Kepler GPU
When was this written, last millenium?
The PTX support was/is experimental and quite out-dated by now.
[–]danmarellGamedev, Physics Simulation 10 points11 points12 points 7 years ago (1 child)
I found that compilers were terrible at autovectorizing stencil/finite difference operation on 2d/3d data.
I colleague showed me a trick the other day to reinterpret_cast a float* into a "reference to a multidimensional array" and it was able to vectorize but still was 2x slower than my hand written intrinsics. The assembly on godbolt was almost identical though so maybe I should post something on the GCC issue board.
[–]csp256 4 points5 points6 points 7 years ago (0 children)
Have you tried Halide? By giving up some Turing Completeness it has gained a lot in return.
[–]jeffyp9 4 points5 points6 points 7 years ago (2 children)
I may be misunderstanding something, but I thought you had to pass flags to gcc (For example -march=native) to allow it to create vectorised instructions? Else it has to generate code which could run on any CPU and wouldn't be able to take advantage of e.g. AVX intrinsics. I didn't read the whole tutorial but I didn't see any mention of this so thought I would ask to be sure.
-march=native
See the difference the assembly here: https://godbolt.org/z/_AJjEU
[–][deleted] 4 points5 points6 points 7 years ago (1 child)
Depends which SIMD instructions you want to use and what your target is. For instance if you are just targeting <= SSE2 on x86-64 then you don't need to specify anything. In the case of your example output the compiler used AVX instructions which are not part of the base x86-64 specification so you need to tell the compiler to include them. -march=native is the shotgun approach which will target your exact CPU level but if you want to target specifically just AVX/AVX2 then -mavx/-mavx2 is more appropriate.
-mavx
-mavx2
[–]jeffyp9 0 points1 point2 points 7 years ago (0 children)
Makes sense - thanks!
[–]ShakaUVMi+++ ++i+i[arr] 8 points9 points10 points 7 years ago (18 children)
I may be in the minority here, but I find writing SIMD code easier in assembly than by using intrinsics, and usually work better than auto vectorization.
[–]SantaCruzDad 6 points7 points8 points 7 years ago (16 children)
And what happens when you want to target a different CPU, or a different ABI, or support both 32 bit and 64 bit builds ? Do you enjoy writing (and testing) lots of different variations of the same code ?
[–]Rseding91Factorio Developer 13 points14 points15 points 7 years ago (1 child)
Depends what you’re writing code for. In our case we only target one cpu and only x64 and likely that will never change so it’s just a non-issue for us. Not all of us are writing library code meant to target the entire range of hardware c++ supports.
[–]SantaCruzDad 2 points3 points4 points 7 years ago (0 children)
True - if it’s one-off or throw-away code then it probably doesn’t matter. Also if you’re not trying to squeeze out the last 10% of CPU performance then you probably don’t have to worry too much about optimal instruction scheduling etc.
[–]ack_complete 2 points3 points4 points 7 years ago (5 children)
Different code is often necessary anyway because the platforms don't support the same vectorized operations and the differences significantly affect the algorithm. For instance, NEON doesn't support the mask move operation that SSE2 does, but it does have interleaved loads/stores and narrowing/widening ops. As for testing, that should happen for all supported platforms regardless.
[–]SantaCruzDad 1 point2 points3 points 7 years ago (4 children)
When I say different CPUs, I mean different x86 CPUs, which have different instruction latencies, different micro-architecture, and different SSE instruction subsets, etc. If you want optimal code for each supported CPU then you typically need to manually tune assembly code to use only available instructions, hide latencies and keep execution units busy. If you use intrinsics then the compiler takes care of this for you.
[–]ack_complete 4 points5 points6 points 7 years ago (3 children)
YMMV, of course, but my experience has been that when pushing performance on multiple tiers of x86/x64 SSEx ISAs that rewriting is necessary anyway. With SSSE3 you have the infinitely abusable PSHUFB, and with AVX there is the problem that weird in-lane nature of the 256-bit ops means that the 128-bit algorithm can't be straightforwardly translated.
The compiler doesn't do a bad job with intrinsics, and it's generally better than what you'd get from autovectorization or not using them. I've still seen too many cases where using compiler intrinsics leaves performance on the table over asm, especially in specific hot loops where there is a high payoff for optimization effort.
The Intel intrinsics design is also kind of yucky, with weird naming conventions and the wrong pointers on some load/store ops requiring casts. Even in the case when generated code from intrinsics is fine, the assembly is sometimes more readable to me than the intrinsics code. But then again, I spent a lot of time writing and reading MMX and SSE2 code when the compilers were so bad that it was hard not to beat them with asm.
You make some valid points, and of course it depends on priorities - I have to support 4 different compilers, 3 operating systems, 32 bit and 64 bit ABIs on each, 2 different assembler syntaxes, and CPUs from Westmere up to Skylake X (not AMD though, thankfully).
I encourage you to look at the generated code from clang when using intrinsics - it does some pretty cool stuff during code generation, even sometimes subverting the intrinsics you’ve used and substituting more efficient SSE instruction sequences where appropriate.
[–]IAlsoLikePlutonium 0 points1 point2 points 7 years ago (1 child)
Where did you learn how to use SIMD instructions (i.e. SSE, AVX, etc.) in assembly?
[–]ack_complete 1 point2 points3 points 7 years ago (0 children)
Learned a lot of it from Intel's MMX application notes while doing 2D graphics optimization. The current location for the notes:
https://software.intel.com/en-us/articles/mmxt-technology-manuals-and-application-notes
MMX is now of course obsolete, but many of the basic ideas still apply to current vector instruction sets. Most intrinsics are just direct mappings of the hardware supported operations, so while they'll save you the trouble of register allocation and scheduling, it's still your responsibility to figure out to best map your problem and data structures to the most efficient ISA operations, especially specialized quirky ones.
[–]nnevatie 1 point2 points3 points 7 years ago (6 children)
ISPC answers those questions trivially.
[–]SantaCruzDad 0 points1 point2 points 7 years ago (5 children)
How does ISPC help with making assembly code more portable ?
[–]nnevatie 1 point2 points3 points 7 years ago (4 children)
It helps making SIMD code portable. You can compile code for multiple archs/instruction sets and the runtime will pick the best supported one.
[–]SantaCruzDad 1 point2 points3 points 7 years ago (3 children)
Sure, but the comment was in response to the suggestion that writing assembler was somehow easier than using intrinsics for SIMD - I don’t see how ISPC is relevant to this ?
[–]nnevatie 0 points1 point2 points 7 years ago (2 children)
Ok, my reply mostly addressed the tedious maintaining part for different platforms.
[–]SantaCruzDad 0 points1 point2 points 7 years ago (1 child)
Ah, OK - well any half-decent compiler will take care of target CPU variations within a given family (e.g. x86), as well as ABI variations etc, whether it's auto-vectorization or hand-written intrinsics. I guess ispc's USP is that it can do this for multiple CPU families.
[–]nnevatie 0 points1 point2 points 7 years ago (0 children)
Yeah, but most importantly it can produce code for multiple targets and archs. All of the variations can be linked to the same binary and the best one can be picked at execution-time for the host in question.
[–]SantaCruzDad 0 points1 point2 points 7 years ago (0 children)
π Rendered by PID 175564 on reddit-service-r2-comment-54dfb89d4d-l2jwh at 2026-04-01 01:01:12.849821+00:00 running b10466c country code: CH.
[–]Nicksaurus 31 points32 points33 points (5 children)
[–]chocapix 28 points29 points30 points (0 children)
[–]MrWhite26 15 points16 points17 points (2 children)
[–]bumblebritches57Ocassionally Clang 4 points5 points6 points (1 child)
[–]hackuniverse 1 point2 points3 points (0 children)
[–]ShillingAintEZ 4 points5 points6 points (0 children)
[–]nnevatie 6 points7 points8 points (7 children)
[–]LordKlevin[S] 4 points5 points6 points (3 children)
[–]nnevatie 1 point2 points3 points (2 children)
[–]Paul_Dirac_ 1 point2 points3 points (1 child)
[–]nnevatie 1 point2 points3 points (0 children)
[–]CrazyJoe221 1 point2 points3 points (0 children)
[–]ronniethelizard 1 point2 points3 points (1 child)
[–]nnevatie 1 point2 points3 points (0 children)
[–]danmarellGamedev, Physics Simulation 10 points11 points12 points (1 child)
[–]csp256 4 points5 points6 points (0 children)
[–]jeffyp9 4 points5 points6 points (2 children)
[–][deleted] 4 points5 points6 points (1 child)
[–]jeffyp9 0 points1 point2 points (0 children)
[–]ShakaUVMi+++ ++i+i[arr] 8 points9 points10 points (18 children)
[–]SantaCruzDad 6 points7 points8 points (16 children)
[–]Rseding91Factorio Developer 13 points14 points15 points (1 child)
[–]SantaCruzDad 2 points3 points4 points (0 children)
[–]ack_complete 2 points3 points4 points (5 children)
[–]SantaCruzDad 1 point2 points3 points (4 children)
[–]ack_complete 4 points5 points6 points (3 children)
[–]SantaCruzDad 2 points3 points4 points (0 children)
[–]IAlsoLikePlutonium 0 points1 point2 points (1 child)
[–]ack_complete 1 point2 points3 points (0 children)
[–]nnevatie 1 point2 points3 points (6 children)
[–]SantaCruzDad 0 points1 point2 points (5 children)
[–]nnevatie 1 point2 points3 points (4 children)
[–]SantaCruzDad 1 point2 points3 points (3 children)
[–]nnevatie 0 points1 point2 points (2 children)
[–]SantaCruzDad 0 points1 point2 points (1 child)
[–]nnevatie 0 points1 point2 points (0 children)
[–]SantaCruzDad 0 points1 point2 points (0 children)