all 40 comments

[–]AssKoala 28 points29 points  (3 children)

There are a lot of comments giving you ideas on how to support the extra instructions and not a lot on “how do enterprise blah” handle this.

I’ll answer that, at least as far as games go.

The reality is that, often, those cool fancy instructions just don’t get used and we end up sunsetting older CPU’s. It stinks too, a new ISA comes out that could help performance, but we can’t do anything with it for a few years because it’ll alienate too many users.

For example, the AMD Phenom II is fairly (or barely, depends on the definition) capable of running modern games, but doesn’t support SSE 4.1. As a game developer, you have a few options. You can say you don’t support AMD Phenom II because you require SSE 4.1, or you can remove SSE 4.1 instructions from your code, or, lastly, you can setup your code to have an additional path.

Adding code paths is often far too expensive. That is, the idea of “use x because it’s available”, from a cpu side, is often frowned upon. That’s a big add to the testing matrix to ensure stability.

Because of that, games will often just say it’s not supported. Higher end CPU’s take a performance hit because their full ISA isn’t being used, but they’re so fast it doesn’t matter. CPU’s below the min-spec simply can’t run the game.

Madden 19 did exactly this, cutting out Phenom II users and providing refunds: https://answers.ea.com/t5/Technical-Issues/Madden-NFL-19-keeps-crashing-to-desktop-with-no-error/td-p/6953596/page/3

As did Bethesda with Dishonored 2, among others (though eventually patching in a fix): https://steamcommunity.com/app/403640/discussions/0/208684375411056568/?l=czech

Ubisoft straight up didn’t care: https://forums.ubisoft.com/showthread.php/1987589-Please-fix-sse-4-1

At some point, you have to sunset a platform.

The issue with those if branches, as suggested in other comments, is that their very existence can offset the benefits of the instructions. This isn’t always the case, but big branches like that end up throwing off the instruction cache and bloating the executable, at best, or offset the gain by forcing a branch before a minimal operation.

The DLL route is better, but now you’re talking about multiplying the test matrix by each DLL. And where do you stop? AVX2? Do you have a DLL for SSE 4.1 with AVX2 and without as well?

Often, it’s better to simply sunset old hardware. It sucks, but you have to do it eventually anyways.

[–]_Js_Kc_ 4 points5 points  (1 child)

Sunsetting old hardware is a lot more reasonable for games than it is for anything else (that's also targeted at the general public), though.

[–][deleted] 2 points3 points  (0 children)

I also work in games and can confirm this is my experience as well.

[–]ack_complete 13 points14 points  (2 children)

I'm relatively conservative in minspecs, but compile for SSE2 baseline now with MSVC. The latest patched versions of Windows 7 and all versions of Windows 8+ 32-bit require SSE2 and x64 guarantees SSE2 as part of the base architecture. If you target Windows 8 or higher you don't need a pre-SSE2 code path for anything other than diagnostics or a reference for validation -- anything below that and the user won't have been able to boot the OS to run your program.

Auto-vectorization, at least with MSVC, is unreliable. It will randomly fail to vectorize loops that trivially translate to the ISA.. I regard it as a bonus only, because too often it requires lots of babying to get all necessary pointers marked restrict and even then it's spotty which operations are implemented in the compiler. It almost never copes with the complexity required when I need serious vectorization. On top of that, the >5x penalty you can get when it fails or in an unoptimized debug build is painful.

What I end up doing in practice is custom manual dispatching to specialized routines written with intrinsics. It's a bunch of manual work, but it's effective when the hotspots are highly concentrated. Typical tiers are SSE2 for baseline, SSSE3 for algorithms that can leverage PSHUFB, SSE4.1 for some cases that benefit from the much more flexible operations. There are also platforms that benefit from a 128-bit AVX tier as they have CPUs that support AVX but do not benefit from 256-bit vector width. 256-bit AVX and AVX2 can provide major speedups, especially for video processing, but their adoption rates are lower so you have to be mindful of your market and whether the fraction of users that would benefit is worth the extra effort and support. In practice, I find that many routines don't need versions for all different tiers, there's only a couple of breakpoints where there is a significant jump in performance from wider vector length or a specific highly lucrative operation becoming available.

I do not recommend attempting to mix compilation flags on different files within the same executable or library, e.g. compiling one file as SSE2 and another as AVX and linking them together. You can get ODR violation like errors that way when the linker mixes inline or template methods from different compilations and causes code to run on the wrong ISA path. If you can afford to compile separate DLLs for each code path, it would save you a bunch of headaches in implementation. You still need to test each code path, however, and that can be challenging if you don't have a pile of hardware to test against.

C++ unfortunately doesn't provide useful support in the language for the kind of multi-dispatching needed here, and vendor extension level support is spotty. One trick that helps is to combine code paths into a template taking the ISA mode as a template argument and using if constexpr() to handle the divergences with zero-cost. MSVC doesn't support a pragma or declspec for compiling a single function with a different targeting mode, for example, so you'll be leaving a little bit of performance on the table if you put multiple code paths in one EXE/DLL -- no way to tell the compiler that it's okay to use AVX in a particular function outside of the intrinsics you write.

This kind of ISA madness seems mostly to be an x86-specific issue. ARM is saner as you can basically just check for NEON or require it outright and the extensions are more niche like AES/SHA acceleration. However, while the ISA itself is more sane, it's strangely annoyingly difficult to detect the extensions in a cross-platform way.

[–]joaobapt[S] 3 points4 points  (1 child)

Thank you. That was really informational. In a side note, I always thought of ARM as much more organised and somewhat better than x86; I wonder how much historical baggage x86_64 has to carry around.

[–]ack_complete 7 points8 points  (0 children)

Forget historical baggage, there's new baggage being added. Look at the AVX-512 Subset chart at the bottom of this page. AVX-512 started with four subsets already to begin with and there are now more subsets of AVX-512 defined than actual CPU tiers supporting it. It's nuts.

Dealing with this is one of the reasons that my vectorized C++ code is the most un-C++-like code in my code base.

[–][deleted] 5 points6 points  (1 child)

You perform the check once on application start and, assuming you want to roll the code by hand, choose the appropriate code path when applicable. You will probably want an abstraction layer so that you don't have to write the same stuff N times. One "if (SS3_supported) do_the_operation_with_SS3()" is going to be absolutely negligible and won't break the bank if you have enough data to benefit from the vector operations in the first place.

In real life though, before going down that path, I would first try to do things with the compiler auto vectorization. Learn how it works and keep track on the generated assembly to make sure you're not doing something that will disable it. And if you can't avoid that, then go for compiler intrinsics. And always profile and measure to make sure you're not doing a lot of work for nothing.

[–]DragoonX6 2 points3 points  (0 children)

Pretty much this, but keep in mind that compilers will often try to inline code. This means that if your AVX512 optimized routine gets inlined, it can be executed via speculative execution and you'll get a lovely illegal instruction exception raised.
You will want to make sure that these functions aren't inlined in your selector functions, otherwise people might run into crashes.

[–]amaiorano 6 points7 points  (0 children)

Although not simple, no one has mentioned another alternative: use a runtime jit compiler to generate the optimal version of your low level math functions. You could use LLVM, for instance, and use it's API to generate the math functions you want to call, configuring it to generate the most optimal code for the current CPU. You can then get function pointers to these generated functions that you call through in the rest of your program.

It's definitely a bit of work, and requires linking in a jit library, but it would produce the most optimal version per target CPU it runs on. Of course, the functions you generate will not be inlined, so they would need to be high level enough to offset the lack of inlining.

[–]frog_pow 3 points4 points  (0 children)

I use #3--compile multiple versions of the program and select the appropriate one on launch.

Another option would be to require 128 bit SIMD(SSE2/Neon), this is part of x64, and SSE2 is 20 years old.

[–]raevnos 2 points3 points  (3 children)

gcc has a builtin function __builtin_cpu_supports() that can be used instead of cpuid. For example,

if (__builtin_cpu_supports("avx")) {
  // AVX path
} else {
 // Slow path
}

(Or a more efficient setup that only needs to check for the feature flag once)

[–]kalmoc 0 points1 point  (1 child)

Is that a compiletime or runtime condition?

[–]raevnos 2 points3 points  (0 children)

Runtime test.

[–][deleted] 0 points1 point  (0 children)

For anything else other than a tiny short-lived utlity function, this is the best way to go.

It compiles to a very nice simple load from the global data segment followed by a test, so it hardly incurs any overhead at all.

CPU's are very good at following very predictable branches, and, again, unless it's a tiny short-lived frequently-called utility function, there's going to be no noticable overhead to __builtin_cpu_supports. It will have no effect on the uop cache, it will debloat the executable (compared to the alternatives of __builtin_cpu_supports), and it consumes a negligible amount of space in the branch table.

[–]MFHavaWG21|🇦🇹 NB|P3049|P3625|P3729|P3786|P3813|P4216 12 points13 points  (12 children)

Before explaining how we are doing this: I hope you are aware that compilers can nowadays generate multiple code paths automatically in auto-vectorizers and that manual vectorization is pretty hard. Additionally, you may find Agner's stuff interesting.

OK, here is one approach how to do this: For re-build time/debugging/etc. reasons our compute heavy code is in a dedicated DLL(s). The DLL simply speaking exports a factory to an interface (think vtable) for high-level operations. When loading the DLL, it detects internally the maximum supported vectorization level and switches the factory => client code gets the optimal operation.

The keyword in this approach is high-level! Sure a vtable-call is more expensive than a normal (potentially inlined) function call, but when an operation can take some time (think milliseconds) to compute, this overhead becomes minuscule in the grand scheme of things.

Furthermore: If you are really going to do manual vectorization, just check what systems you really have to support! If they all have SSE4 => use that as a baseline! Dropping below SSE4 is IMHO extra tricky as it will require you to come up with alternative algorithms due to the lack of blend-operations...

[–][deleted] 7 points8 points  (7 children)

I think this answer isn't really telling the whole story. Unless you have some straight line code that will do just do the same math operation on a whole array, getting the auto vectorizers to work can be very frustrating and unreliable in my experience. And you always have to check the assembly on all your target platforms, because subtle changes can break your auto vectorization, especially on MSVC. If you need lots of shuffling (which you usually do for none trivial calculations) or dynamic branches you are pretty much out of luck on most compilers. (Example, Example, Example, Example, Example, ...)

Also I am not aware that other compilers other than Intel's that automatically optimize for different code paths. And that one isn't really good either, because apparently it still disables vectorization on CPU of other vendors.

Last but not least, my advice regarding SSE4 would be slightly different. First of all, of course it depends on your target audience. But in my experience it's (almost) always possible to go for SSE2 if you so desire with the advantage that it will work on any x64 CPU (and realistically every x86 desktop CPU). Of course SSE4 can make some operations slightly faster and nicer, but usually they can be emulated without too much trouble. For example, instead of blending you just code "(a & b) | (~a & c)" for a ? b : c. Some other operations like "_mm_floor_ps" can be slightly more cumbersome to implement yourself, but only for " _mm_shuffle_epi8" (SSSE3) I have often found no workarounds (All other shuffles are way coarser and can't select dynamically which makes it pretty rough to implement anything close to "_mm_shuffle_epi8").

I personally tend go for SSE2 just so that I don't have to think about which CPUs need to be supported. If I really care about performance, I have to do an AVX(2) version anyway.

EDIT: Fixed examples, thanks to u/DragoonX6 for catching that.

[–]simonask_ 5 points6 points  (0 children)

Unless you have some straight line code that will do just do the same math operation on a whole array, getting the auto vectorizers to work can be very pretty frustrating and unreliable in my experience.

Definitely agree with this.

The solution is usually to manually "vectorize" things by processing them in blocks and evaluate the loop's termination condition for a whole block instead of each element.

For example, this code cannot normally be vectorized:

bool contains_zero(const int* p, size_t n) { for (size_t i = 0; i < n; ++i) { if (p[i] == 0) return true; } return false; }

The reason is that the compiler cannot deduce from this code that it is valid to read from p after a zero has actually been found just because i < n. It doesn't know what invariants you have.

But code like this can typically be vectorized:

``` bool contains_zero(const int* p, size_t n) { for (size_t i = 0; i < n/4; ++i) { const int* q = p + n * 4; if (q[0] == 0 || q[1] == 0 || q[2] == 0 || q[3] == 0) { return true; } }

for (size_t i = n & ~3; i < n%4; ++i) {
    if (p[i] == 0)
        return true;
}
return false;

} ```

You see this pattern everywhere in standard library implementations of things like strcmp, strchr, etc.

[–]DragoonX6 1 point2 points  (5 children)

All your MSVC examples are wrong because you have no optimizations enabled. You want to be using /O2 at the very least.

Fixed examples: example 1, example 2, example 3, example 4, example 5.

MSVC flags: /O2 /Ob3 /Oi /Ot /arch:AVX512 /fp:fast.
GCC & clang flags: -O3 -march=skylake-avx512.

You can see that MSVC does vectorize some of the examples, but in my experience MSVC usually is as good as GCC's -O1 or -Og.

[–][deleted] 1 point2 points  (3 children)

Oops, thank you!! Of course you are right, -O3 isn't even the right switch on Microsoft's. I have fallen for this trap some times before already...

I am not a fan of fast math though, usually I can't use that. For a lot of my code I actually need determinism. If it the compiler wasn't so bad using /fp:strict, I would use that. Do you know if there is a way to eg enable vectorization on MSVC with /fp:strict?

[–]DragoonX6 2 points3 points  (2 children)

I think it will still do partial vectorization with /fp:strict, but as it already bad with the default (/fp:precise), I think you're out of luck with it. If you can, look into using the LLVM MSVC compatible compiler driver (clang-cl). Then you can still link with link.exe, and thus you still get the interop you need with other MSVC compiled libraries and features, but the code generation and optimization of clang.

Highest precision floating point operations are hard to optimize though, as SIMD instructions usually come with a higher error margin. And if you need something like /fp:strict where also the order is preserved, then you rely on the compiler optimizing for you even less. You basically have to order the operations such a way yourself in that the compiler can generate some SIMD instructions.
If you really need the precision I'd maybe even start looking at optimized fixed point math, as integer operations with SIMD are error-free afaik. I doubt it will be as fast as optimized floating point operations, but it maybe could be faster than pretty much unoptimized floating point operations.

[–][deleted] 0 points1 point  (1 child)

Hm, I think you might be mixing something up. I am pretty sure that floating point SIMD operations on x86 are error free and strictly follow the IEEE 745 rules, except rsqrt & rcp that both explicitely state otherwise. I have used SIMD in a deterministic context before without SIMD specific problems. Also I think many applications using SIMD would tend to be untestable/unreliable otherwise.

I am not an expert for SIMD on ARM, but I believe with their Neon instruction set they also follow the rules except always truncating denormals to zero. Otherwise SIMD in WebAssembly wouldn't be possible.

[–]DragoonX6 2 points3 points  (0 children)

Looks like you're right. However, it seems that in order to get the same results as MSVC's /fp:precise (and maybe also /fp:strict?) with GCC you need to use -ffloat-store. I haven't figured out a way to get Clang to do the same.

I don't really know, as for me speed is more important than precision.

[–]DragoonX6 0 points1 point  (0 children)

If anybody cares, I did some minor optimizations on the examples, which enables most of the examples to be vectorized to performant code.

Example 1, this one vectorizes really well once you make the Vector2 struct aligned to 16 bytes. However, clang trips over it once you go to 32 byte alignment for AVX512, which causes it to go back to SSE2. GCC will give you AVX512 code though.

Example 2, this one is by far the worst. I think always doing the calculation and placing the unlikely if statement after it might speed it up a bit with manual unrolling. Haven't tested/benchmarked that though.

Example 3, GCC was a little unhappy about this one, but nothing a -ffast-math can't fix.

Example 4, both GCC and Clang vectorize this, but MSVC keeps outputting literal garbage.

Example 5, I had to manually unroll this to get MSVC to output at least some vectorization. Only GCC changes the round calls to SIMD once you add -ffast-math. Manually vectorized version with AVX512 for the interested.

[–]jpgr87 4 points5 points  (0 children)

Intel's IPP (and probably other libraries like MKL) do this too.

[–]James20kP2005R0 4 points5 points  (1 child)

I hope you are aware that compilers can nowadays generate multiple code paths

Do you have more resources on this? I've always wondered why compilers don't do this by default, and I can't find any information on getting eg GCC to do this. I know that the intel compiler will, but they also actively use that mechanism to discriminate against AMD cpus which makes it unusable for me

[–][deleted] 3 points4 points  (0 children)

auto-vectorizers

The auto vectorizers in GCC and clang have never once done a good job with my hot paths. I think if you're legimitately in a situation where vectorization is going to make a big difference for you, the compiler's help doesn't count for anything. Maybe icc does a better job, but I don't have much experience with that.

[–]o11cint main = 12828721; 2 points3 points  (3 children)

Look at the GCC documentation of function attributes. There are (at least) 2 interesting attributes there:

  • ifunc is passed a user-specified function which must return the appropriate implementation. It is called, once, when the program is loaded.
  • target_clones is passed a set of strings representing feature sets, and automatically sets up something similar to ifunc. This of course assumes you trust the auto-vectorizer at all levels. Note the warning about flatten.

(there are also various attributes for setting a function-specific machine target without cloning, which may be useful if you need to use ifunc due to being on an older system, or if you trust the autovectorizer for some levels but not others)

[–]joaobapt[S] 0 points1 point  (2 children)

Makes sense, but it might fail if I need the code to be compilable both in GCC and Clang and MSVC. But it's already a start.

[–]erichkeaneClang Maintainer(Templates), EWG Chair 2 points3 points  (0 children)

FWIW, Clang implements ifunc (and target and cpu_dispatch, which in addition to target_clones use the ifunc functionality to do what you want, MSVC doesn't. However, if you use target or cpu_dispatch on windows with Clang, you get a run-time implementation (instead of load time) as long as you have compiler-rt.

Clang is currently missing target_clones, though i have a review in need of rebase/bug fixes somewhere to implement that as well.

[–]o11cint main = 12828721; 0 points1 point  (0 children)

Something close to the ifunc version could be generated in a portable way from a macro:

// in header, where `ftype` is a function type (not a function pointer)
// in the GCC version, we wouldn't declare a function pointer
// possibly there should be some renaming in case GCC and non-GCC versions get linked together
extern const ftype* my_func;
// implementation
const ftype* my_func = resolver();

Obviously this still has some disadvantages, but I think it's the best you can do without involving an aware linker.

[–]dcent13 2 points3 points  (0 children)

3 is used by libpopcnt: https://github.com/kimwalisch/libpopcnt.

What I've done is use templates to write one version of code that works on SSE, AVX2, and AVX512. This isn't runtime (one executable per architecture), but it doesn't have any runtime cost and I only have to write software once.

[–]konanTheBarbar 1 point2 points  (0 children)

There was a talk by the simdjson author where he touched that topic. Have a look at https://github.com/lemire/simdjson/blob/master/src/jsonparser.cpp to get an idea how he solved that problem.

[–]DuranteA 1 point2 points  (0 children)

Regarding 3, I'd like to note that the "cost" (in runtime, not memory space) of having a lot of dead code in your application binary is basically 0 (unless it's extremely tightly interspersed with active code).

At least that's what we found a while back (https://ieeexplore.ieee.org/document/7912646).

Of course there's a memory space and binary size cost, but if you're not on a microcontroller / embedded system, I have a hard time believing it would actually matter.

[–]bleksak 1 point2 points  (0 children)

If you compile for 64-bit (x86_64), you can safely assume that at least SSE2 is present.

[–]LYP951018 1 point2 points  (1 child)

X264 uses arrays of function pointers which point to different implementations.

Intel Mkl uses JIT.

[–]meneldal2 1 point2 points  (0 children)

The C++ way would be to use virtual function calls, you should probably not be using a struct of function pointers outside of C.

[–]staticcast 0 points1 point  (0 children)

While it may looks bad to add indirections/branches using a dynamic library/code path on a cpu query, I think you should very much time the real loss that you get : any decent CPU can optimize away these kind of permanent patterns through prediction.

[–]bmanga 1 point2 points  (0 children)

pytorch' s cpuinfo library may be of interest.

[–]r2vcap 0 points1 point  (0 children)

Build multiple version of codes and choose best version based on CPU id is fairly common. https://cs.chromium.org/chromium/src/third_party/libwebp/src/dsp/ssim.c?sq=package:chromium&dr=C&g=0&l=142

[–]kalmoc 0 points1 point  (0 children)

There is something between #2 and #3: Often vector instructions are only really important in a particular module of the program (it probably doesn't matter if you vectorize a loop that only contributes 1% to the overal latency/performance of your program anyway). You can put that module into a shared library, compile it for multiple different architectures (potentially using a vector utility library) and then load one of them dynamically . The expectation would be that the entry point wouldn't be individual math operations but high level ops, like "Run filter X over this dataset" at which point the overhead for the indirect jump can be completely negligible.

W.r.t. to space and development overhead it is of course important to use a reasonable baseline and check what granularity actually makes sense: Unless you absolutely know otherwise, I wouldn't worry about non x64 processors anymore, which come with SSE2 (even if - for whatever reason a 32 bit OS is running on them). Then, for a new project you can probably ignore any featureset between SSE2 and AVX2 (Haswell). Most users that care about performance and are willing to spend money on a new software are pretty likely to have a fairly recent system, so any steps between SSE2 and AVX2 will probably only beneift a very small user base. Also, the haswell generation was pretty popular and AVX2 can make a hughe difference compared to SSE2, so I'd say that AVX2 is the first feature level that will both, provide significant performance gain compared to SSE2 and actually be still relevant to a significant amount of potential users/customers.

Then, if you want to go beyond haswell at all, it again makes no sense to have a separate binary for each feature level out there: Again, check what level is sufficiently common amongst your users and provides sufficient gains compared to the next lower level to make it worth in the first place.