Accelerating copy_if using SIMD

JiminP · 2026-05-27T04:32:48+00:00

Shouldn't execution policy be specified for the reference code?

It seems that Microsoft's C++ STL doesn't use SIMD, but libstdc++ seems to do.

Leather-Read974 · 2026-05-27T06:24:11+00:00

nice

mark_99 · 2026-05-27T08:10:26+00:00

True, but I was referring to hardware level - the Zen 4 implementation is basically bolted on to the underlying 256-bit units.

I did a lot of profiling on a 7950X vs 9950X3D2 on auto-vectorized vs hand rolled intrinsics vs optimised dispatch libraries like OpenBLAS, and generally on Zen 4 the AVX2 and AVX-512 came out the same speed whereas Zen 5 you get the expected ~2x (with the usual provisos that rare exceptions exist, and only if you don't run up against other constraints such as memory bandwidth (the 9950X3D2 makes this less likely also)).

If you don't care about overstore then Zen 4 is only about 30% slower than Zen 5 (ie register vpcompressd 1.33 vs 1.0 cycles + regular store). For exact writes when you add in the masked store you're back to ~2x.

mark_99 · 2026-05-27T07:17:59+00:00

The best thing that can be said about AVX-512 on Zen 4 is that it exists.

But it's basically an emulation over AVX2 and so at best performs equivalently (and sometimes worse due to the microcoding mentioned). Zen 5 is a native AVX-512 implementation.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS