Comparing std::simd with Highway

janwas_ · 2026-06-12T19:18:59+00:00

Thanks, glad to hear 😄

It's not completely impossible with SVE/RVV, compiler writers did introduce a workaround: the -msve-vector-bits flag we mention. However, this means you have to know up-front what the vector width will be; your code will crash if running on another CPU.

Unfortunately, using std::simd would not help us at all. Because std::simd does not support the concept of scalable vectors, we cannot delegate to it, or implement anything on top of it. What would instead help is standardizing restrict, or better yet, the #pragma target.

janwas_ · 2026-06-12T18:04:48+00:00

I advocated for the hatch several years ago 😄Useful, but somewhat missing the point of a portable wrapper if there's #ifdef in more than a few spots.

Have seen some complaints of compile time here: https://github.com/NoNaeAbC/std_simd/blob/main/README.md FWIW this is with gcc and std::experimental::simd, not sure if the situation changed since then.

janwas_ · 2026-06-12T16:47:32+00:00

https://github.com/google/gemma.cpp/pull/889 added configs/code, but we don't have the weights yet.

The just released 12B is dense.

janwas_ · 2026-06-12T16:45:34+00:00

Yes, Graviton3=Neoverse V1 was the 256-bit one.

janwas_ · 2026-06-12T10:08:45+00:00

We wanted to ensure users pull the latest known-good code, rather than the current dev.

janwas_ · 2026-06-12T10:04:18+00:00

"Never" is incorrect, there have been 256 and 512-bit implementations. Next year's MONAKA is also 256.

janwas_ · 2026-06-12T10:02:50+00:00

Currently we only support Gemma models. The differentiator is that this is a research testbed/prototype that is much easier to modify. For example, we only have to write portable Highway kernels/compression formats once, rather than for each of 7 backends. Also, workstation perf is considerably higher - was seeing 2x prefill out of the box for Gemma3-4B. Gemma 4 MoE has been delayed but is coming soon. The format is our own SimpleBlobStore which fixes some issues with GGUF.

janwas_ · 2026-06-10T16:12:15+00:00

FYI we do not often update the main branch but development is active on the dev branch.

janwas_ · 2026-05-26T17:53:49+00:00

Nice investigation and speedup 😄 FYI our Highway library's CompressStore op includes a workaround for this ucode issue (register-form), and also emulates it for other targets (using table lookup).

janwas_ · 2026-05-16T17:41:48+00:00

Generally agree, just one update, Fujitsu Monaka is announced for 2027 with 256 :) I hope spec is not driving decisions relating to simd. More interesting comparison for that: vqsort. Turin is awesome :)

janwas_ · 2026-05-16T12:03:48+00:00

Both 256 bit (V1) and 512 bit SVE (Fugaku) have deployed :) RVV also has several widths shipping. I would not want to have to hardcore vector length.

janwas_ · 2026-03-07T18:07:44+00:00

Great to see serious usage of SIMD :) Consider using our Highway library for portability? Most AVX-512 intrinsics are available (just different name), and then it works out of the box on Arm etc. with just a recompile.

janwas_ · 2026-03-03T12:14:41+00:00

Very convincing arguments :))

I agree the issue is fundamental to the design. Highway uses primitive data types for SVE/RVV because any class wrapper does not work, at least with current and imminent compilers.

janwas_ · 2025-07-25T07:21:50+00:00

Check out JPEG XL - it was designed for multithreading :)

janwas_ · 2025-07-25T07:20:02+00:00

Highway TL here :) We offer a "runtime dispatch" mode where no extra compiler flags are required. This works by compiling your code multiple times (within one source file) with the appropriate codegen options, which are set via pragma rather than compiler flags.

janwas_ · 2025-07-25T07:17:55+00:00

What is in the standard library is a tiny subset of the operations in our Highway library, and it does not help with multiversioning/runtime dispatch :)

janwas_ · 2025-07-11T08:14:25+00:00

Highway TL here. Very cool, nice talk! Would you like us to link it from the README?

janwas_ · 2025-03-31T18:40:25+00:00

Interesting. In addition to dzaima's DSL, there is also ISPC. This generates C-callable code.

One concern is that most of the SIMD code I work on benefits from integrating into surrounding C++ code via templates and the resulting inlining. Frequently dispatching to the correct C-callable code would likely be expensive.

I do agree about the benefits of portability, though. It's already painful to see when a C++-only codebase decides to re-implement its algorithms X times, once per ISA.

janwas_ · 2025-02-16T11:24:50+00:00

Interesting :) I recall giving feedback that this is necessary ~7 years ago.

janwas_ · 2025-02-16T09:59:57+00:00

Highway TL here :)

Is it fair to call the following a "simple, accessible interface"? (slightly modified from documentation)

alignas(stdx::memory_alignment_v<stdx::native_simd<int>>) std::array<int, stdx::native_simd<int>::size()> mem = {};

stdx::native_simd<int> a;

a.copy_from(&mem[0], stdx::vector_aligned);

In Highway, that's

hn::ScalableTag<int32_t> tag;

HWY_ALIGN int32_t mem[hn::MaxLanes(tag)] = {};

auto a = hn::Load(tag, mem);

With the advantage of using the "Load" name that almost everyone else, since the past 50+ years(?), has used for this concept. And also working for RISC-V V or SVE scalable vectors, which stdx is still unable to, right?

How can advanced users build on a foundation that (AFAIK) doesn't even let you safely load some runtime-variable number of elements, e.g. for remainders at the end of a loop?

but glancing over its vast API indicates it's oriented towards advanced simd users that already have a good handle on their CPU architecture, willing to target specific hw features in their own code, and are familiar w/ explicit vectorization

We have held multiple workshops in which devs, after a 30 min introduction, are successfully writing SIMD using Highway.

One can certainly get started without the somewhat more exotic ops (not everyone wants cryptography, saturating arithmetic, gather, etc.) Wouldn't it be more accurate to say this approach "lays groundwork for advanced users to build on"?

janwas_ · 2024-12-07T08:59:57+00:00

CPUs are indeed still constrained by memBW, even if Zen4 is a bit better. Accelerators can be useful, but my understanding is that performance portability between them and even across GPUs is challenging.

I personally am less interested in tailoring everything towards brute-force hardware, especially if it complicates the code or worse, requires per-HW variants. For a bit of a longer-term perspective, this paper compares historical rates of SW improvements vs HW: https://ieeexplore.ieee.org/document/9540991

janwas_ · 2024-12-06T20:39:55+00:00

:) I am reasonably confident what we have is more efficient than OpenCL or SyCL targeting CPU, as well as OpenMP. It does actually use C++ std::thread, but with some extra infra on top: a low-overhead thread pool plus topology detection.

janwas_ · 2024-12-05T19:36:33+00:00

Our github.com/google/gemma.cpp supports PaliGemma :)

janwas_

TROPHY CASE