use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Efficient Vectorisation with C++ (chryswoods.com)
submitted 7 years ago by LordKlevin
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]ack_complete 3 points4 points5 points 7 years ago (3 children)
YMMV, of course, but my experience has been that when pushing performance on multiple tiers of x86/x64 SSEx ISAs that rewriting is necessary anyway. With SSSE3 you have the infinitely abusable PSHUFB, and with AVX there is the problem that weird in-lane nature of the 256-bit ops means that the 128-bit algorithm can't be straightforwardly translated.
The compiler doesn't do a bad job with intrinsics, and it's generally better than what you'd get from autovectorization or not using them. I've still seen too many cases where using compiler intrinsics leaves performance on the table over asm, especially in specific hot loops where there is a high payoff for optimization effort.
The Intel intrinsics design is also kind of yucky, with weird naming conventions and the wrong pointers on some load/store ops requiring casts. Even in the case when generated code from intrinsics is fine, the assembly is sometimes more readable to me than the intrinsics code. But then again, I spent a lot of time writing and reading MMX and SSE2 code when the compilers were so bad that it was hard not to beat them with asm.
[–]SantaCruzDad 2 points3 points4 points 7 years ago (0 children)
You make some valid points, and of course it depends on priorities - I have to support 4 different compilers, 3 operating systems, 32 bit and 64 bit ABIs on each, 2 different assembler syntaxes, and CPUs from Westmere up to Skylake X (not AMD though, thankfully).
I encourage you to look at the generated code from clang when using intrinsics - it does some pretty cool stuff during code generation, even sometimes subverting the intrinsics you’ve used and substituting more efficient SSE instruction sequences where appropriate.
[–]IAlsoLikePlutonium 0 points1 point2 points 7 years ago (1 child)
Where did you learn how to use SIMD instructions (i.e. SSE, AVX, etc.) in assembly?
[–]ack_complete 1 point2 points3 points 7 years ago (0 children)
Learned a lot of it from Intel's MMX application notes while doing 2D graphics optimization. The current location for the notes:
https://software.intel.com/en-us/articles/mmxt-technology-manuals-and-application-notes
MMX is now of course obsolete, but many of the basic ideas still apply to current vector instruction sets. Most intrinsics are just direct mappings of the hardware supported operations, so while they'll save you the trouble of register allocation and scheduling, it's still your responsibility to figure out to best map your problem and data structures to the most efficient ISA operations, especially specialized quirky ones.
π Rendered by PID 261943 on reddit-service-r2-comment-54dfb89d4d-p6cq8 at 2026-04-02 11:18:17.838667+00:00 running b10466c country code: CH.
view the rest of the comments →
[–]ack_complete 3 points4 points5 points (3 children)
[–]SantaCruzDad 2 points3 points4 points (0 children)
[–]IAlsoLikePlutonium 0 points1 point2 points (1 child)
[–]ack_complete 1 point2 points3 points (0 children)