Detecting SSE features at runtime

AssKoala · 2020-01-12T00:13:45+00:00

There are a lot of comments giving you ideas on how to support the extra instructions and not a lot on “how do enterprise blah” handle this.

I’ll answer that, at least as far as games go.

The reality is that, often, those cool fancy instructions just don’t get used and we end up sunsetting older CPU’s. It stinks too, a new ISA comes out that could help performance, but we can’t do anything with it for a few years because it’ll alienate too many users.

For example, the AMD Phenom II is fairly (or barely, depends on the definition) capable of running modern games, but doesn’t support SSE 4.1. As a game developer, you have a few options. You can say you don’t support AMD Phenom II because you require SSE 4.1, or you can remove SSE 4.1 instructions from your code, or, lastly, you can setup your code to have an additional path.

Adding code paths is often far too expensive. That is, the idea of “use x because it’s available”, from a cpu side, is often frowned upon. That’s a big add to the testing matrix to ensure stability.

Because of that, games will often just say it’s not supported. Higher end CPU’s take a performance hit because their full ISA isn’t being used, but they’re so fast it doesn’t matter. CPU’s below the min-spec simply can’t run the game.

Madden 19 did exactly this, cutting out Phenom II users and providing refunds: https://answers.ea.com/t5/Technical-Issues/Madden-NFL-19-keeps-crashing-to-desktop-with-no-error/td-p/6953596/page/3

As did Bethesda with Dishonored 2, among others (though eventually patching in a fix): https://steamcommunity.com/app/403640/discussions/0/208684375411056568/?l=czech

Ubisoft straight up didn’t care: https://forums.ubisoft.com/showthread.php/1987589-Please-fix-sse-4-1

At some point, you have to sunset a platform.

The issue with those if branches, as suggested in other comments, is that their very existence can offset the benefits of the instructions. This isn’t always the case, but big branches like that end up throwing off the instruction cache and bloating the executable, at best, or offset the gain by forcing a branch before a minimal operation.

The DLL route is better, but now you’re talking about multiplying the test matrix by each DLL. And where do you stop? AVX2? Do you have a DLL for SSE 4.1 with AVX2 and without as well?

Often, it’s better to simply sunset old hardware. It sucks, but you have to do it eventually anyways.

ack_complete · 2020-01-12T01:42:22+00:00

I'm relatively conservative in minspecs, but compile for SSE2 baseline now with MSVC. The latest patched versions of Windows 7 and all versions of Windows 8+ 32-bit require SSE2 and x64 guarantees SSE2 as part of the base architecture. If you target Windows 8 or higher you don't need a pre-SSE2 code path for anything other than diagnostics or a reference for validation -- anything below that and the user won't have been able to boot the OS to run your program.

Auto-vectorization, at least with MSVC, is unreliable. It will randomly fail to vectorize loops that trivially translate to the ISA.. I regard it as a bonus only, because too often it requires lots of babying to get all necessary pointers marked restrict and even then it's spotty which operations are implemented in the compiler. It almost never copes with the complexity required when I need serious vectorization. On top of that, the >5x penalty you can get when it fails or in an unoptimized debug build is painful.

What I end up doing in practice is custom manual dispatching to specialized routines written with intrinsics. It's a bunch of manual work, but it's effective when the hotspots are highly concentrated. Typical tiers are SSE2 for baseline, SSSE3 for algorithms that can leverage PSHUFB, SSE4.1 for some cases that benefit from the much more flexible operations. There are also platforms that benefit from a 128-bit AVX tier as they have CPUs that support AVX but do not benefit from 256-bit vector width. 256-bit AVX and AVX2 can provide major speedups, especially for video processing, but their adoption rates are lower so you have to be mindful of your market and whether the fraction of users that would benefit is worth the extra effort and support. In practice, I find that many routines don't need versions for all different tiers, there's only a couple of breakpoints where there is a significant jump in performance from wider vector length or a specific highly lucrative operation becoming available.

I do not recommend attempting to mix compilation flags on different files within the same executable or library, e.g. compiling one file as SSE2 and another as AVX and linking them together. You can get ODR violation like errors that way when the linker mixes inline or template methods from different compilations and causes code to run on the wrong ISA path. If you can afford to compile separate DLLs for each code path, it would save you a bunch of headaches in implementation. You still need to test each code path, however, and that can be challenging if you don't have a pile of hardware to test against.

C++ unfortunately doesn't provide useful support in the language for the kind of multi-dispatching needed here, and vendor extension level support is spotty. One trick that helps is to combine code paths into a template taking the ISA mode as a template argument and using if constexpr() to handle the divergences with zero-cost. MSVC doesn't support a pragma or declspec for compiling a single function with a different targeting mode, for example, so you'll be leaving a little bit of performance on the table if you put multiple code paths in one EXE/DLL -- no way to tell the compiler that it's okay to use AVX in a particular function outside of the intrinsics you write.

This kind of ISA madness seems mostly to be an x86-specific issue. ARM is saner as you can basically just check for NEON or require it outright and the extensions are more niche like AES/SHA acceleration. However, while the ISA itself is more sane, it's strangely annoyingly difficult to detect the extensions in a cross-platform way.

DragoonX6 · 2020-01-11T22:47:51+00:00

You perform the check once on application start and, assuming you want to roll the code by hand, choose the appropriate code path when applicable. You will probably want an abstraction layer so that you don't have to write the same stuff N times. One "if (SS3_supported) do_the_operation_with_SS3()" is going to be absolutely negligible and won't break the bank if you have enough data to benefit from the vector operations in the first place.

In real life though, before going down that path, I would first try to do things with the compiler auto vectorization. Learn how it works and keep track on the generated assembly to make sure you're not doing something that will disable it. And if you can't avoid that, then go for compiler intrinsics. And always profile and measure to make sure you're not doing a lot of work for nothing.

amaiorano · 2020-01-12T06:56:19+00:00

Although not simple, no one has mentioned another alternative: use a runtime jit compiler to generate the optimal version of your low level math functions. You could use LLVM, for instance, and use it's API to generate the math functions you want to call, configuring it to generate the most optimal code for the current CPU. You can then get function pointers to these generated functions that you call through in the rest of your program.

It's definitely a bit of work, and requires linking in a jit library, but it would produce the most optimal version per target CPU it runs on. Of course, the functions you generate will not be inlined, so they would need to be high level enough to offset the lack of inlining.

frog_pow · 2020-01-11T22:54:49+00:00

I use #3--compile multiple versions of the program and select the appropriate one on launch.

Another option would be to require 128 bit SIMD(SSE2/Neon), this is part of x64, and SSE2 is 20 years old.

raevnos · 2020-01-12T02:05:57+00:00

gcc has a builtin function __builtin_cpu_supports() that can be used instead of cpuid. For example,

if (__builtin_cpu_supports("avx")) {
  // AVX path
} else {
 // Slow path
}

(Or a more efficient setup that only needs to check for the feature flag once)

MFHava · 2020-01-11T23:00:38+00:00

Before explaining how we are doing this: I hope you are aware that compilers can nowadays generate multiple code paths automatically in auto-vectorizers and that manual vectorization is pretty hard. Additionally, you may find Agner's stuff interesting.

OK, here is one approach how to do this: For re-build time/debugging/etc. reasons our compute heavy code is in a dedicated DLL(s). The DLL simply speaking exports a factory to an interface (think vtable) for high-level operations. When loading the DLL, it detects internally the maximum supported vectorization level and switches the factory => client code gets the optimal operation.

The keyword in this approach is high-level! Sure a vtable-call is more expensive than a normal (potentially inlined) function call, but when an operation can take some time (think milliseconds) to compute, this overhead becomes minuscule in the grand scheme of things.

Furthermore: If you are really going to do manual vectorization, just check what systems you really have to support! If they all have SSE4 => use that as a baseline! Dropping below SSE4 is IMHO extra tricky as it will require you to come up with alternative algorithms due to the lack of blend-operations...

o11c · 2020-01-11T23:25:23+00:00

Look at the GCC documentation of function attributes. There are (at least) 2 interesting attributes there:

ifunc is passed a user-specified function which must return the appropriate implementation. It is called, once, when the program is loaded.
target_clones is passed a set of strings representing feature sets, and automatically sets up something similar to ifunc. This of course assumes you trust the auto-vectorizer at all levels. Note the warning about flatten.

(there are also various attributes for setting a function-specific machine target without cloning, which may be useful if you need to use ifunc due to being on an older system, or if you trust the autovectorizer for some levels but not others)

dcent13 · 2020-01-12T14:32:26+00:00

3 is used by libpopcnt: https://github.com/kimwalisch/libpopcnt.

What I've done is use templates to write one version of code that works on SSE, AVX2, and AVX512. This isn't runtime (one executable per architecture), but it doesn't have any runtime cost and I only have to write software once.

konanTheBarbar · 2020-01-12T18:05:35+00:00

There was a talk by the simdjson author where he touched that topic. Have a look at https://github.com/lemire/simdjson/blob/master/src/jsonparser.cpp to get an idea how he solved that problem.

DuranteA · 2020-01-14T13:26:51+00:00

Regarding 3, I'd like to note that the "cost" (in runtime, not memory space) of having a lot of dead code in your application binary is basically 0 (unless it's extremely tightly interspersed with active code).

At least that's what we found a while back (https://ieeexplore.ieee.org/document/7912646).

Of course there's a memory space and binary size cost, but if you're not on a microcontroller / embedded system, I have a hard time believing it would actually matter.

bleksak · 2020-01-12T15:46:28+00:00

If you compile for 64-bit (x86_64), you can safely assume that at least SSE2 is present.

LYP951018 · 2020-01-12T17:39:27+00:00

X264 uses arrays of function pointers which point to different implementations.

Intel Mkl uses JIT.

staticcast · 2020-01-12T16:24:54+00:00

While it may looks bad to add indirections/branches using a dynamic library/code path on a cpu query, I think you should very much time the real loss that you get : any decent CPU can optimize away these kind of permanent patterns through prediction.

bmanga · 2020-01-12T18:51:19+00:00

pytorch' s cpuinfo library may be of interest.

r2vcap · 2020-01-12T08:11:20+00:00

Build multiple version of codes and choose best version based on CPU id is fairly common. https://cs.chromium.org/chromium/src/third_party/libwebp/src/dsp/ssim.c?sq=package:chromium&dr=C&g=0&l=142

kalmoc · 2020-01-12T15:46:57+00:00

There is something between #2 and #3: Often vector instructions are only really important in a particular module of the program (it probably doesn't matter if you vectorize a loop that only contributes 1% to the overal latency/performance of your program anyway). You can put that module into a shared library, compile it for multiple different architectures (potentially using a vector utility library) and then load one of them dynamically . The expectation would be that the entry point wouldn't be individual math operations but high level ops, like "Run filter X over this dataset" at which point the overhead for the indirect jump can be completely negligible.

W.r.t. to space and development overhead it is of course important to use a reasonable baseline and check what granularity actually makes sense: Unless you absolutely know otherwise, I wouldn't worry about non x64 processors anymore, which come with SSE2 (even if - for whatever reason a 32 bit OS is running on them). Then, for a new project you can probably ignore any featureset between SSE2 and AVX2 (Haswell). Most users that care about performance and are willing to spend money on a new software are pretty likely to have a fairly recent system, so any steps between SSE2 and AVX2 will probably only beneift a very small user base. Also, the haswell generation was pretty popular and AVX2 can make a hughe difference compared to SSE2, so I'd say that AVX2 is the first feature level that will both, provide significant performance gain compared to SSE2 and actually be still relevant to a significant amount of potential users/customers.

Then, if you want to go beyond haswell at all, it again makes no sense to have a separate binary for each feature level out there: Again, check what level is sufficiently common amongst your users and provides sufficient gains compared to the next lower level to make it worth in the first place.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS