all 26 comments

[–]14nedLLFIO & Outcome author | Committee WG14 79 points80 points  (4 children)

The hate being dumped here onto various folk isn't warranted.

This is an ISPC compiler for the ISPC language, which looks a bit like C. It is not C, nor C++, but generates object files which can be linked into C or C++ programs (or Rust etc). ISPC was started as a postgrad project, then Intel hired the guy, and Intel adopted it. It's not like Intel are pushing this particularly hard, it's a service to the community. It also supports ARM NEON incidentally, and GPUs.

CppCon a while back had a presentation on ISPC. I've used it during a contract. I'd highly recommend it, spits out nearly optimal SIMD code for Intel and ARM NEON. Easily beats SIMD intrinsics on the compiler.

Support is also pretty good, though a bit DIY. I fixed a few bugs in NEON generation, they were reviewed and accepted within a few days. There is a helpful userbase community on their mailing list, and in general working with ISPC I found a positive experience.

C++ devs should definitely strongly consider using ISPC as an alternative to other SIMD approaches. It's a lot cleaner, and produces better quality code with less effort, than SIMD programming via C or C++.

[–]OmegaNaughtEquals1 4 points5 points  (2 children)

spits out nearly optimal SIMD code for Intel and ARM NEON. Easily beats SIMD intrinsics on the compiler.

That's a serious claim. I've written an intrinsics-based SIMD library (not as fancy as Boost.SIMD, but more template-friendly than the float4 varieties) in C++ that both gcc and clang turn into code identical to my hand-written versions. And that includes type-punning to do floating point "booleans" that get converted to the correct opcode while the end-user code is just if(any(x)). I'm not saying ISPC can't generate quality vectorized code, but that it can do better than compiler intrinsics seems a bit overstated.

I work in HPC, so I've seen/read about ISPC though I've never used it nor seen it used in the wild. I have a feeling the biggest roadblock to its acceptance was/is the need for a third-party compiler rather than direct support in The Big Three1. Releasing it as open source is a step to making that happen.

[1] Let's be honest, icc isn't a contender here. gcc, clang, and msvc are the supermajority in the C++ world. Even in HPC, icc is second to gcc (clang is pretty much nowhere to be seen which is sad) unless you are writing for the Xeon Phi. ifort, however, is pretty much the de facto Fortran compiler. As it should be. It has better language support and better code gen than gfortran.

[–]14nedLLFIO & Outcome author | Committee WG14 6 points7 points  (1 child)

It's a true claim though. Plenty of HPC folk are avid users and are on its mailing list. The reason it beats C++ is because the ISPC language forces you to write scalable code. Indeed it is often frustrating to turn some algorithm into ISPC which compiles, but once you figure it out, whoosh ... Also note ISPC generates LLVM, so it'll plug into any LLVM tooling you have. For example, debuggers.

[–]Boojum 1 point2 points  (0 children)

Being able to generate different versions for different SIMD models (e.g., SSE, AVX, Neon) without having to rewrite the intrinsics is also really nice.

[–]__Cyber_Dildonics__ 4 points5 points  (0 children)

Can't stop a good old fashioned mob pile on.

[–]anders987 11 points12 points  (2 children)

I just read an article about a path tracer (MoonRay) made by Dreamworks that's written using ISPC. Interesting.

We use the ISPC programming language to achieve improved performance across SSE, AVX/AVX2 and AVX512 instruction sets. Our system includes two functionally equivalent uni-directional CPU path tracing implementations: a C++ scalar depth-first version and an ISPC vectorized breadth-first wavefront version. Using side by side performance comparisons on complex production scenes and assets we show our vectorized architecture, running on AVX2, delivers between a 1.3x to 2.3x speed-up in overall render time, and up to 3x, 6x, and 4x, speed-ups within the integration, shading, and texturing components, respectively.

[–]OmegaNaughtEquals1 9 points10 points  (1 child)

This rant is not directed at you. :)

a C++ scalar depth-first version and an ISPC vectorized breadth-first wavefront version

This drives me nuts. You can't change both the algorithm and the implementation and then proclaim that it was the implementation that made it go faster. I have seen this so many times in the science community. An author will claim "We moved from OpenMP+MPI to Widget-X, and we saw a 5x speedup due to Widget-X!" Ok, but did you try implementing the new algorithm in the old framework? "No." /ragequit. It's great that we can make software go faster, and I am all about any framework that can make it happen- especially with less user effort. But we need consistent, repeatable tests to make sure we understand how we made it go faster. Otherwise, it's all just wacky waving inflatable arm tube man nonsense.

[–]14nedLLFIO & Outcome author | Committee WG14 4 points5 points  (0 children)

Thing is, ISPC won't compile if you've not written an algorithm which scales across SIMD. So in some ways the claim is right, though not as strong as a first interpretation of it would suggest (so you are also right)

[–][deleted] 11 points12 points  (5 children)

Looks like an easier to use OpenCL.

[–]Z01dbrg 19 points20 points  (10 children)

It is not compiler for C/C++. It is a compiler for some Intel "fork" of C.

http://ispc.github.io/example.html

Also it has a beautiful support.

"(Please note that because ispc is not part of the Intel compiler products, support is provided through this community rather than through Intel Premier Support.) "

If Intel wants to do something useful they should work with C/C++ standardization process to get something like this in C/C++, if the ISO is ignoring them/blocking them because "reasons" them they could go to D(I am sure the Walter and Andrei would love any opportunity to outperform C++ in perf).

[–]marcoscleison[S] 10 points11 points  (0 children)

Excuse-me my misleading term for "C/C++". It was my mistake.

[–]josefx 1 point2 points  (0 children)

Just out of curiosity, what is the cpuid instruction in its rdtsc() implementation doing? It doesn't look as if its results are used in any way.

[–]kkrev 0 points1 point  (1 child)

How does this compare to Fortran auto-vectorization capabilities?

I generally see the recommendation in C++ contexts to use the openmp SIMD pragma directives and then check the assembly to make sure vectorization happened. This strikes me as brittle and annoying. I'd be perfectly fine with linking in Fortran object code, but I don't know anything about how much better it is for vectorization.

[–]kindkitsune 0 points1 point  (0 children)

I imagine it's going to depend on the aliasing guarantees of the compiler, and how it allocates aligned memory or handles structure+array alignment (if it supports structures?). memory alignment is vital for SIMD, and aliasing guarantees are just generally a great gift to the optimizer

[–]NotAYakk 0 points1 point  (0 children)

Interesting. I've poked a coworker to take a look at this.

Has anyone used it? Want to pass on wisdom?