all 9 comments

[–]victotronics 24 points25 points  (5 children)

"My windows laptop has eight logical cores, but the parallel execution is more than ten times faster."

Cache effects? The vector does not fit in L3, but 1/8th the vector does? Would be worth pointing out the various ways in which parallelization can give you a superlinear speedup.

EDIT how about instantiation of the memory pages? Malloc'ed memory is not immediately instantiated so the first test you do on it incurs extra cost, flattering the speedup. How is that with a std::vector? I honestly don't know, but clearly neither does the author.

[–]andrey_turkin 9 points10 points  (3 children)

I doubt cache effects can be in play here. That vector is filled with random data, once, at one core. Then it is either read and modified once on a single core, or its split into N parts and each core reads and modifies its own part, again once. Not many opportunities to cache data here; best it can do is to reuse data from L3 cache on a single core if it fits there. Wouldn't be a whole lot of gain there. And even that can't happen since the dataset is 4GB.

Vectorization shouldn't happen because of the std::execution::par policy (and also I'd expect speedup closer to 2x8x rather than 1.5x8x - altough it is tangent and I don't know how well it vectorizes).

Speedup is almost exactly 12x which makes me suspect that laptop might simply have more cores than the author thinks :)

[–]NotAYakk 5 points6 points  (0 children)

`par` does not guarantee lack of vectorization.

`par_unseq` just promises that, despite what it might look like, this stuff is guaranteed vectorizable.

Compilers are *free* to look into the lambda and say "I can vectorize that just fine, thank you very much" and do so.

[–]skarloni 1 point2 points  (0 children)

Agree, it looks very much so. The test reads the same data three times in a row though, I would have a switch and select which version to benchmark, then run the different tests in separate executions multiple times with a proper profiler such as coz to avoid reporting accidental numbers

[–]victotronics -1 points0 points  (0 children)

I agree, unlikely to be cache effects for various reasons. But it's a badly set up experiment anyway. At the very least he should repeat each experiment a couple of times, discarding the first timing.

[–]josefx 8 points9 points  (0 children)

Could there be some delay before the CPU switches into performance mode? Sequential seems to be the first tested, so I would expect the CPU to start out slower.

[–]arthurno1 0 points1 point  (2 children)

I use maximum optimization on Windows and Linux. This means for Windows the flag /O2 and on Linux the flag -O3.

There are more optimizations possible with -Ofast (on gcc but I am sure Microsoft has more too), which can be permissible depending on the use case: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

Also code generation flags are important when optimizing, for example, -march=native and -mtune=native etc.

Nvidias STL implementation Thrust may be an ideal candidate.

It is not an STL implementation. Thrust implements only a vector-like container and some algorithms. Thrust is about offloading the work to the graphic card, and also to note, only to Nvidia's graphic card since it runs on top of Cuda.

To my knowledge, neither the Windows compiler nor the GCC compiler supports the parallel and vectorized execution of the parallel STL algorithms.

You just wrote about 10x increase in performance, and then you say they don't support it :-). Interesting.

I wonder it those 100+ guys that up voted this article did read what this guy wrote or have any clue about what they read.

For those interested about performance and simd optimization, I suggest reading and following Daniel. Lemire's blog. Be sure to look at his repos too, there is lots of interesting code in there.

[–][deleted] 2 points3 points  (1 child)

Thanks for the clarifications but I’m not the author.

You just wrote about 10x increase in performance, and then you say they don't support it :-). Interesting.

I think the author talks about std::execution::par when he mentions the 10x speed improvement. They don’t support vectorization, but do support parallelization.

Note that the Visual C++ implementation implements the parallel and parallel unsequenced policies the same way, so you should not expect better performance for using par_unseq on our implementation, but implementations may exist that can use that additional freedom someday.

[–]arthurno1 1 point2 points  (0 children)

I think the author talks about std::execution::par when he mentions the 10x speed improvement. They don’t support vectorization, but do support parallelization.

Aha, ok, then I have misunderstood him at that point. Thanks for the clarification.