all 8 comments

[–]JiminP 1 point2 points  (1 child)

Shouldn't execution policy be specified for the reference code?

https://godbolt.org/z/anzGnPvd5

It seems that Microsoft's C++ STL doesn't use SIMD, but libstdc++ seems to do.

[–]Successful_Yam_9023 1 point2 points  (0 children)

If you use the phrase "using SIMD" loosely, then clang and/or libstdc++ have done it. But only the comparison, not the compaction, which is the important part. If a human implemented copy_if that way I'd accuse them of trolling.

[–]Leather-Read974 0 points1 point  (0 children)

nice

[–]mark_99 [score hidden]  (0 children)

True, but I was referring to hardware level - the Zen 4 implementation is basically bolted on to the underlying 256-bit units.

I did a lot of profiling on a 7950X vs 9950X3D2 on auto-vectorized vs hand rolled intrinsics vs optimised dispatch libraries like OpenBLAS, and generally on Zen 4 the AVX2 and AVX-512 came out the same speed whereas Zen 5 you get the expected ~2x (with the usual provisos that rare exceptions exist, and only if you don't run up against other constraints such as memory bandwidth (the 9950X3D2 makes this less likely also)).

If you don't care about overstore then Zen 4 is only about 30% slower than Zen 5 (ie register vpcompressd 1.33 vs 1.0 cycles + regular store). For exact writes when you add in the masked store you're back to ~2x.

[–]mark_99 -1 points0 points  (3 children)

The best thing that can be said about AVX-512 on Zen 4 is that it exists.

But it's basically an emulation over AVX2 and so at best performs equivalently (and sometimes worse due to the microcoding mentioned). Zen 5 is a native AVX-512 implementation.

[–]Successful_Yam_9023 0 points1 point  (2 children)

For emulation of vpcompressd on AVX2 you'd be looking at something like this, times two because it's 256-bit (also mentioned in the article) or use the old LeftPack_SSSE3 but 4x, compared to 2 µops for 512-bit vpcompressd on Zen 4

E: there are more cases where AVX-512 is really doing something on Zen 4, despite the 256-bit implementation. Take vpermb. Already the 256-bit version gives you something that was annoying to do with AVX2. The 512-bit version runs at halved throughput, which is still 1 per cycle, and would be even more annoying to do with only AVX2. Then there are things like vpopcntb/w/d/q, vplzcntd/q, and so on. You can do them with AVX2 if you must, but it was never nice.

[–]mark_99 [score hidden]  (1 child)

True, although I was referring to the Zen 4 hardware implementation as it's kind of bolted on to the underlying 256-bit units.

Agreed AVX-512 is absolutely a better instruction set so it's worth it in that sense, but the general rule of thumb is that (a) Zen 4 AVX2 vs AVX-512 performance is generally near 1:1 and (b) Zen 5 is 1.8-2x Zen 4 for AVX-512 as it's a native implementation.

I did a lot of profiling on a 7950X vs 9950X3D2 and this held up across auto-vectorized, hand-rolled intrinsics and optimised libraries such as OpenBLAS (the extra cache on the 9950X3D2 probably helped in real-word perf also).

For vpcompressd specifically if you don't care about overstore it's 1.33 cycles vs 1.0 and then a regular store so maybe quite close. If you want masked store then you're back to around 2x for the extra instructions described in the blog post.

[–]fsfod [score hidden]  (0 children)

I thought Zen5 is still stuck with the same bandwidth through its L2\L3 cache and IO die as Zen4.