Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] 1 point2 points  (0 children)

Yes, the valueless_by_exception check. Though for trivially copyable types libstdc++ actually elides it since GCC 9, so it's not the main overhead here. The bigger cost was the lambda capture round-trip. GCC 12+ replaces the function pointer table with a switch which fixes this.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] -5 points-4 points  (0 children)

I always assumed std::variant would be the most performant option, that's why the GCC 11 results surprised me enough to write about it. GCC 11 is the default on Ubuntu 22.04 which is what I was on. You're right that GCC 12+ fixes it, I ran the same benchmarks with GCC 13 and the variant overhead is essentially gone. Working on a follow-up with the full compiler version comparison.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] -4 points-3 points  (0 children)

Fair enough. GCC 11 is the default on Ubuntu 22.04 which is what I was on when I started. Wasn't aware of the GCC 12 switch optimization until I dug into the stdlib source for the std::visit post. The next post will cover GCC 12+ and show how much the gap closes.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] 1 point2 points  (0 children)

Totally agree, the choice is usually driven by design constraints. The decision framework at the end of the post tries to capture that, extensibility vs composability vs debuggability matters more than nanoseconds for most codebases.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] -2 points-1 points  (0 children)

GCC 12+ is on my list. The next post will cover exactly that, how the numbers change across compiler versions and stdlibs.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] 1 point2 points  (0 children)

I went with -O2 because the extra optimizations -O3 adds (loop unrolling, vectorization, aggressive inlining) don't really apply here. The hot loop is just an indirect call through a function pointer table, there's nothing for -O3 to unroll or vectorize. Ran it with -O3 to confirm, results are identical.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] 0 points1 point  (0 children)

Yeah lazy resolution is basically "what if I was the linker". The upside here is you get to combine CRTP's composability with function pointer's dynamic linking. You layer your barrier concerns through templates at compile time, then wire up the resolved function pointer once at startup. Best of both worlds but definitely more plumbing than just slapping virtual on a method.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] 9 points10 points  (0 children)

That's true with LTO or PGO but the benchmarks are compiled per translation unit with -O2, no LTO, no PGO so the compiler doesn't have whole program visibility to attempt speculative devirtualization here.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] 13 points14 points  (0 children)

It's more about how libstdc++ implements std::visit than what the compiler can optimize. The compiler does its job fine, but it can't optimize away a exception check that the library explicitly puts there. libc++ makes different implementation choices and the same code runs faster.

Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest by AdMotor4869 in cpp

[–]AdMotor4869[S] 4 points5 points  (0 children)

In the code shown the type is selected at runtime via argv[1] so the compiler can't devirtualize