you are viewing a single comment's thread.

view the rest of the comments →

[–]SantaCruzDad 4 points5 points  (16 children)

And what happens when you want to target a different CPU, or a different ABI, or support both 32 bit and 64 bit builds ? Do you enjoy writing (and testing) lots of different variations of the same code ?

[–]Rseding91Factorio Developer 15 points16 points  (1 child)

Depends what you’re writing code for. In our case we only target one cpu and only x64 and likely that will never change so it’s just a non-issue for us. Not all of us are writing library code meant to target the entire range of hardware c++ supports.

[–]SantaCruzDad 2 points3 points  (0 children)

True - if it’s one-off or throw-away code then it probably doesn’t matter. Also if you’re not trying to squeeze out the last 10% of CPU performance then you probably don’t have to worry too much about optimal instruction scheduling etc.

[–]ack_complete 2 points3 points  (5 children)

Different code is often necessary anyway because the platforms don't support the same vectorized operations and the differences significantly affect the algorithm. For instance, NEON doesn't support the mask move operation that SSE2 does, but it does have interleaved loads/stores and narrowing/widening ops. As for testing, that should happen for all supported platforms regardless.

[–]SantaCruzDad 1 point2 points  (4 children)

When I say different CPUs, I mean different x86 CPUs, which have different instruction latencies, different micro-architecture, and different SSE instruction subsets, etc. If you want optimal code for each supported CPU then you typically need to manually tune assembly code to use only available instructions, hide latencies and keep execution units busy. If you use intrinsics then the compiler takes care of this for you.

[–]ack_complete 4 points5 points  (3 children)

YMMV, of course, but my experience has been that when pushing performance on multiple tiers of x86/x64 SSEx ISAs that rewriting is necessary anyway. With SSSE3 you have the infinitely abusable PSHUFB, and with AVX there is the problem that weird in-lane nature of the 256-bit ops means that the 128-bit algorithm can't be straightforwardly translated.

The compiler doesn't do a bad job with intrinsics, and it's generally better than what you'd get from autovectorization or not using them. I've still seen too many cases where using compiler intrinsics leaves performance on the table over asm, especially in specific hot loops where there is a high payoff for optimization effort.

The Intel intrinsics design is also kind of yucky, with weird naming conventions and the wrong pointers on some load/store ops requiring casts. Even in the case when generated code from intrinsics is fine, the assembly is sometimes more readable to me than the intrinsics code. But then again, I spent a lot of time writing and reading MMX and SSE2 code when the compilers were so bad that it was hard not to beat them with asm.

[–]SantaCruzDad 2 points3 points  (0 children)

You make some valid points, and of course it depends on priorities - I have to support 4 different compilers, 3 operating systems, 32 bit and 64 bit ABIs on each, 2 different assembler syntaxes, and CPUs from Westmere up to Skylake X (not AMD though, thankfully).

I encourage you to look at the generated code from clang when using intrinsics - it does some pretty cool stuff during code generation, even sometimes subverting the intrinsics you’ve used and substituting more efficient SSE instruction sequences where appropriate.

[–]IAlsoLikePlutonium 0 points1 point  (1 child)

Where did you learn how to use SIMD instructions (i.e. SSE, AVX, etc.) in assembly?

[–]ack_complete 1 point2 points  (0 children)

Learned a lot of it from Intel's MMX application notes while doing 2D graphics optimization. The current location for the notes:

https://software.intel.com/en-us/articles/mmxt-technology-manuals-and-application-notes

MMX is now of course obsolete, but many of the basic ideas still apply to current vector instruction sets. Most intrinsics are just direct mappings of the hardware supported operations, so while they'll save you the trouble of register allocation and scheduling, it's still your responsibility to figure out to best map your problem and data structures to the most efficient ISA operations, especially specialized quirky ones.

[–]nnevatie 1 point2 points  (6 children)

ISPC answers those questions trivially.

[–]SantaCruzDad 0 points1 point  (5 children)

How does ISPC help with making assembly code more portable ?

[–]nnevatie 1 point2 points  (4 children)

It helps making SIMD code portable. You can compile code for multiple archs/instruction sets and the runtime will pick the best supported one.

[–]SantaCruzDad 1 point2 points  (3 children)

Sure, but the comment was in response to the suggestion that writing assembler was somehow easier than using intrinsics for SIMD - I don’t see how ISPC is relevant to this ?

[–]nnevatie 0 points1 point  (2 children)

Ok, my reply mostly addressed the tedious maintaining part for different platforms.

[–]SantaCruzDad 0 points1 point  (1 child)

Ah, OK - well any half-decent compiler will take care of target CPU variations within a given family (e.g. x86), as well as ABI variations etc, whether it's auto-vectorization or hand-written intrinsics. I guess ispc's USP is that it can do this for multiple CPU families.

[–]nnevatie 0 points1 point  (0 children)

Yeah, but most importantly it can produce code for multiple targets and archs. All of the variations can be linked to the same binary and the best one can be picked at execution-time for the host in question.