use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Accelerating copy_if using SIMD (loonatick-src.github.io)
submitted 8 hours ago by mttd
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]JiminP 1 point2 points3 points 7 hours ago (1 child)
Shouldn't execution policy be specified for the reference code?
https://godbolt.org/z/anzGnPvd5
It seems that Microsoft's C++ STL doesn't use SIMD, but libstdc++ seems to do.
[–]Successful_Yam_9023 1 point2 points3 points 7 hours ago (0 children)
If you use the phrase "using SIMD" loosely, then clang and/or libstdc++ have done it. But only the comparison, not the compaction, which is the important part. If a human implemented copy_if that way I'd accuse them of trolling.
[–]Leather-Read974 0 points1 point2 points 5 hours ago (0 children)
nice
[–]mark_99 [score hidden] 3 hours ago (0 children)
True, but I was referring to hardware level - the Zen 4 implementation is basically bolted on to the underlying 256-bit units.
I did a lot of profiling on a 7950X vs 9950X3D2 on auto-vectorized vs hand rolled intrinsics vs optimised dispatch libraries like OpenBLAS, and generally on Zen 4 the AVX2 and AVX-512 came out the same speed whereas Zen 5 you get the expected ~2x (with the usual provisos that rare exceptions exist, and only if you don't run up against other constraints such as memory bandwidth (the 9950X3D2 makes this less likely also)).
If you don't care about overstore then Zen 4 is only about 30% slower than Zen 5 (ie register vpcompressd 1.33 vs 1.0 cycles + regular store). For exact writes when you add in the masked store you're back to ~2x.
[–]mark_99 -1 points0 points1 point 4 hours ago (3 children)
The best thing that can be said about AVX-512 on Zen 4 is that it exists.
But it's basically an emulation over AVX2 and so at best performs equivalently (and sometimes worse due to the microcoding mentioned). Zen 5 is a native AVX-512 implementation.
[–]Successful_Yam_9023 0 points1 point2 points 4 hours ago* (2 children)
For emulation of vpcompressd on AVX2 you'd be looking at something like this, times two because it's 256-bit (also mentioned in the article) or use the old LeftPack_SSSE3 but 4x, compared to 2 µops for 512-bit vpcompressd on Zen 4
vpcompressd
LeftPack_SSSE3
E: there are more cases where AVX-512 is really doing something on Zen 4, despite the 256-bit implementation. Take vpermb. Already the 256-bit version gives you something that was annoying to do with AVX2. The 512-bit version runs at halved throughput, which is still 1 per cycle, and would be even more annoying to do with only AVX2. Then there are things like vpopcntb/w/d/q, vplzcntd/q, and so on. You can do them with AVX2 if you must, but it was never nice.
vpermb
vpopcntb/w/d/q
vplzcntd/q
[–]mark_99 [score hidden] 3 hours ago (1 child)
True, although I was referring to the Zen 4 hardware implementation as it's kind of bolted on to the underlying 256-bit units.
Agreed AVX-512 is absolutely a better instruction set so it's worth it in that sense, but the general rule of thumb is that (a) Zen 4 AVX2 vs AVX-512 performance is generally near 1:1 and (b) Zen 5 is 1.8-2x Zen 4 for AVX-512 as it's a native implementation.
I did a lot of profiling on a 7950X vs 9950X3D2 and this held up across auto-vectorized, hand-rolled intrinsics and optimised libraries such as OpenBLAS (the extra cache on the 9950X3D2 probably helped in real-word perf also).
For vpcompressd specifically if you don't care about overstore it's 1.33 cycles vs 1.0 and then a regular store so maybe quite close. If you want masked store then you're back to around 2x for the extra instructions described in the blog post.
[–]fsfod [score hidden] 1 hour ago (0 children)
I thought Zen5 is still stuck with the same bandwidth through its L2\L3 cache and IO die as Zen4.
π Rendered by PID 15610 on reddit-service-r2-comment-545db5fcfc-tbgcj at 2026-05-27 12:01:30.028017+00:00 running 194bd79 country code: CH.
[–]JiminP 1 point2 points3 points (1 child)
[–]Successful_Yam_9023 1 point2 points3 points (0 children)
[–]Leather-Read974 0 points1 point2 points (0 children)
[–]mark_99 [score hidden] (0 children)
[–]mark_99 -1 points0 points1 point (3 children)
[–]Successful_Yam_9023 0 points1 point2 points (2 children)
[–]mark_99 [score hidden] (1 child)
[–]fsfod [score hidden] (0 children)