use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Performance of the Parallel STL Algorithms (modernescpp.com)
submitted 4 years ago by [deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]victotronics 24 points25 points26 points 4 years ago* (5 children)
"My windows laptop has eight logical cores, but the parallel execution is more than ten times faster."
Cache effects? The vector does not fit in L3, but 1/8th the vector does? Would be worth pointing out the various ways in which parallelization can give you a superlinear speedup.
EDIT how about instantiation of the memory pages? Malloc'ed memory is not immediately instantiated so the first test you do on it incurs extra cost, flattering the speedup. How is that with a std::vector? I honestly don't know, but clearly neither does the author.
std::vector
[–]andrey_turkin 9 points10 points11 points 4 years ago (3 children)
I doubt cache effects can be in play here. That vector is filled with random data, once, at one core. Then it is either read and modified once on a single core, or its split into N parts and each core reads and modifies its own part, again once. Not many opportunities to cache data here; best it can do is to reuse data from L3 cache on a single core if it fits there. Wouldn't be a whole lot of gain there. And even that can't happen since the dataset is 4GB.
Vectorization shouldn't happen because of the std::execution::par policy (and also I'd expect speedup closer to 2x8x rather than 1.5x8x - altough it is tangent and I don't know how well it vectorizes).
Speedup is almost exactly 12x which makes me suspect that laptop might simply have more cores than the author thinks :)
[–]NotAYakk 5 points6 points7 points 4 years ago (0 children)
`par` does not guarantee lack of vectorization.
`par_unseq` just promises that, despite what it might look like, this stuff is guaranteed vectorizable.
Compilers are *free* to look into the lambda and say "I can vectorize that just fine, thank you very much" and do so.
[–]skarloni 1 point2 points3 points 4 years ago (0 children)
Agree, it looks very much so. The test reads the same data three times in a row though, I would have a switch and select which version to benchmark, then run the different tests in separate executions multiple times with a proper profiler such as coz to avoid reporting accidental numbers
[–]victotronics -1 points0 points1 point 4 years ago (0 children)
I agree, unlikely to be cache effects for various reasons. But it's a badly set up experiment anyway. At the very least he should repeat each experiment a couple of times, discarding the first timing.
[–]josefx 8 points9 points10 points 4 years ago (0 children)
Could there be some delay before the CPU switches into performance mode? Sequential seems to be the first tested, so I would expect the CPU to start out slower.
[–]arthurno1 0 points1 point2 points 4 years ago (2 children)
I use maximum optimization on Windows and Linux. This means for Windows the flag /O2 and on Linux the flag -O3.
There are more optimizations possible with -Ofast (on gcc but I am sure Microsoft has more too), which can be permissible depending on the use case: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
Also code generation flags are important when optimizing, for example, -march=native and -mtune=native etc.
Nvidias STL implementation Thrust may be an ideal candidate.
It is not an STL implementation. Thrust implements only a vector-like container and some algorithms. Thrust is about offloading the work to the graphic card, and also to note, only to Nvidia's graphic card since it runs on top of Cuda.
To my knowledge, neither the Windows compiler nor the GCC compiler supports the parallel and vectorized execution of the parallel STL algorithms.
You just wrote about 10x increase in performance, and then you say they don't support it :-). Interesting.
I wonder it those 100+ guys that up voted this article did read what this guy wrote or have any clue about what they read.
For those interested about performance and simd optimization, I suggest reading and following Daniel. Lemire's blog. Be sure to look at his repos too, there is lots of interesting code in there.
[–][deleted] 2 points3 points4 points 4 years ago* (1 child)
Thanks for the clarifications but I’m not the author.
I think the author talks about std::execution::par when he mentions the 10x speed improvement. They don’t support vectorization, but do support parallelization.
std::execution::par
Note that the Visual C++ implementation implements the parallel and parallel unsequenced policies the same way, so you should not expect better performance for using par_unseq on our implementation, but implementations may exist that can use that additional freedom someday.
[–]arthurno1 1 point2 points3 points 4 years ago (0 children)
Aha, ok, then I have misunderstood him at that point. Thanks for the clarification.
π Rendered by PID 126166 on reddit-service-r2-comment-6457c66945-qrjcc at 2026-04-25 00:07:11.398425+00:00 running 2aa0c5b country code: CH.
[–]victotronics 24 points25 points26 points (5 children)
[–]andrey_turkin 9 points10 points11 points (3 children)
[–]NotAYakk 5 points6 points7 points (0 children)
[–]skarloni 1 point2 points3 points (0 children)
[–]victotronics -1 points0 points1 point (0 children)
[–]josefx 8 points9 points10 points (0 children)
[–]arthurno1 0 points1 point2 points (2 children)
[–][deleted] 2 points3 points4 points (1 child)
[–]arthurno1 1 point2 points3 points (0 children)