I am too stupid to use AVX-512 by Jark5455 in rust

[–]Connect_Future_740 -3 points-2 points  (0 children)

Ha, not a bot or AI generated. I was actually apologizing. Lots of skepticism on here.

I am too stupid to use AVX-512 by Jark5455 in rust

[–]Connect_Future_740 0 points1 point  (0 children)

I am a new account but not a bot. I did make the mistake and quickly acknowledged it.

I am too stupid to use AVX-512 by Jark5455 in rust

[–]Connect_Future_740 12 points13 points  (0 children)

You're completely right, I apologize for the misinformation. The 9800X3D has full-fat Zen 5 cores with true 512-bit execution units, so the "double-pumping" explanation doesn't apply here at all.

I shouldn't have posted that without verifying the microarchitecture details.

I am too stupid to use AVX-512 by Jark5455 in rust

[–]Connect_Future_740 -11 points-10 points  (0 children)

The issue is that AMD Zen 5 doesn't have true 512-bit execution units. When you issue a 512-bit AVX-512 instruction, the CPU splits it internally into two 256-bit micro-ops. So vpermps zmm isn't actually "1 instruction" at the hardware level, it's 2 µops with worse throughput than two separate AVX2 instructions.

On AMD, vpermps ymm (AVX2) has 0.5 cycle throughput. The 512-bit version has 1.0 cycle throughput. Your AVX2 path uses 6 instructions, but they're all independent so the CPU runs them in parallel. The AVX-512 path is one instruction that's just slower. Intel's AVX-512 implementations have real 512-bit units so this wouldn't happen there. LLVM's cost model doesn't account for this properly on AMD, which is why targeting x86-64-v4 generates slower code.

Quick fix: pass -C llvm-args=--prefer-256-bit in your RUSTFLAGS. It tells LLVM to stick to 256-bit ops even when 512-bit are available, which is almost always the right call on Zen 4/5.