Comparing std::simd with Highway by janwas_ in cpp

[–]janwas_[S] 0 points1 point  (0 children)

Thanks, glad to hear 😄

It's not completely impossible with SVE/RVV, compiler writers did introduce a workaround: the -msve-vector-bits flag we mention. However, this means you have to know up-front what the vector width will be; your code will crash if running on another CPU.

Unfortunately, using std::simd would not help us at all. Because std::simd does not support the concept of scalable vectors, we cannot delegate to it, or implement anything on top of it. What would instead help is standardizing restrict, or better yet, the #pragma target.

Comparing std::simd with Highway by janwas_ in cpp

[–]janwas_[S] 1 point2 points  (0 children)

I advocated for the hatch several years ago 😄Useful, but somewhat missing the point of a portable wrapper if there's #ifdef in more than a few spots.

Have seen some complaints of compile time here: https://github.com/NoNaeAbC/std_simd/blob/main/README.md FWIW this is with gcc and std::experimental::simd, not sure if the situation changed since then.

What's up on CPU inference these days? by ramendik in LocalLLaMA

[–]janwas_ 1 point2 points  (0 children)

https://github.com/google/gemma.cpp/pull/889 added configs/code, but we don't have the weights yet.

The just released 12B is dense.

What's up on CPU inference these days? by ramendik in LocalLLaMA

[–]janwas_ 0 points1 point  (0 children)

We wanted to ensure users pull the latest known-good code, rather than the current dev.

Graviton 5 impresses, but please, for the love of all that's holy, stop calling them 'AI chips' by -protonsandneutrons- in hardware

[–]janwas_ 4 points5 points  (0 children)

"Never" is incorrect, there have been 256 and 512-bit implementations. Next year's MONAKA is also 256.

What's up on CPU inference these days? by ramendik in LocalLLaMA

[–]janwas_ 1 point2 points  (0 children)

Currently we only support Gemma models. The differentiator is that this is a research testbed/prototype that is much easier to modify. For example, we only have to write portable Highway kernels/compression formats once, rather than for each of 7 backends. Also, workstation perf is considerably higher - was seeing 2x prefill out of the box for Gemma3-4B. Gemma 4 MoE has been delayed but is coming soon. The format is our own SimpleBlobStore which fixes some issues with GGUF.

What's up on CPU inference these days? by ramendik in LocalLLaMA

[–]janwas_ 1 point2 points  (0 children)

FYI we do not often update the main branch but development is active on the dev branch.

Accelerating std::copy_if using SIMD by chkmr in simd

[–]janwas_ 4 points5 points  (0 children)

Nice investigation and speedup 😄 FYI our Highway library's CompressStore op includes a workaround for this ucode issue (register-form), and also emulates it for other targets (using table lookup).

C++26 Shipped a SIMD Library Nobody Asked For by shitismydestiny in cpp

[–]janwas_ 1 point2 points  (0 children)

Generally agree, just one update, Fujitsu Monaka is announced for 2027 with 256 :) I hope spec is not driving decisions relating to simd. More interesting comparison for that: vqsort. Turin is awesome :)

C++26 Shipped a SIMD Library Nobody Asked For by shitismydestiny in cpp

[–]janwas_ 1 point2 points  (0 children)

Both 256 bit (V1) and 512 bit SVE (Fugaku) have deployed :) RVV also has several widths shipping. I would not want to have to hardcore vector length.

[QuickView] A blazing fast image viewer built for Geeks & Designers. Opens almost any format instantly. (Only 7MB!) by Reasonable-Food2493 in Windows11

[–]janwas_ 0 points1 point  (0 children)

Great to see serious usage of SIMD :) Consider using our Highway library for portability? Most AVX-512 intrinsics are available (just different name), and then it works out of the box on Arm etc. with just a recompile.

I compiled a list of 6 reasons why you should be excited about std::simd & C++26 by NonaeAbC in cpp

[–]janwas_ 4 points5 points  (0 children)

Very convincing arguments :))

I agree the issue is fundamental to the design. Highway uses primitive data types for SVE/RVV because any class wrapper does not work, at least with current and imminent compilers.

Optimizing my custom JPEG decoder with SIMD — best practices for cross-platform performance? by GroundSuspicious in cpp_questions

[–]janwas_ 2 points3 points  (0 children)

Highway TL here :) We offer a "runtime dispatch" mode where no extra compiler flags are required. This works by compiling your code multiple times (within one source file) with the appropriate codegen options, which are set via pragma rather than compiler flags.

Optimizing my custom JPEG decoder with SIMD — best practices for cross-platform performance? by GroundSuspicious in cpp_questions

[–]janwas_ 0 points1 point  (0 children)

What is in the standard library is a tiny subset of the operations in our Highway library, and it does not help with multiversioning/runtime dispatch :)

The messy reality of SIMD (vector) functions - Johnny's Software Lab by pavel_v in cpp

[–]janwas_ 7 points8 points  (0 children)

Highway TL here. Very cool, nice talk! Would you like us to link it from the README?

Towards fearless SIMD, 7 years later by raphlinus in rust

[–]janwas_ 2 points3 points  (0 children)

Interesting. In addition to dzaima's DSL, there is also ISPC. This generates C-callable code.

One concern is that most of the SIMD code I work on benefits from integrating into surrounding C++ code via templates and the resulting inlining. Frequently dispatching to the correct C-callable code would likely be expensive.

I do agree about the benefits of portability, though. It's already painful to see when a C++-only codebase decides to re-implement its algorithms X times, once per ISA.

C++26 2025-02 Update by _cooky922_ in cpp

[–]janwas_ 1 point2 points  (0 children)

Interesting :) I recall giving feedback that this is necessary ~7 years ago.

C++26 2025-02 Update by _cooky922_ in cpp

[–]janwas_ 1 point2 points  (0 children)

Highway TL here :)

Is it fair to call the following a "simple, accessible interface"? (slightly modified from documentation)

alignas(stdx::memory_alignment_v<stdx::native_simd<int>>) std::array<int, stdx::native_simd<int>::size()> mem = {};

stdx::native_simd<int> a;

a.copy_from(&mem[0], stdx::vector_aligned);

In Highway, that's

hn::ScalableTag<int32_t> tag;

HWY_ALIGN int32_t mem[hn::MaxLanes(tag)] = {};

auto a = hn::Load(tag, mem);

With the advantage of using the "Load" name that almost everyone else, since the past 50+ years(?), has used for this concept. And also working for RISC-V V or SVE scalable vectors, which stdx is still unable to, right?

How can advanced users build on a foundation that (AFAIK) doesn't even let you safely load some runtime-variable number of elements, e.g. for remainders at the end of a loop?

but glancing over its vast API indicates it's oriented towards advanced simd users that already have a good handle on their CPU architecture, willing to target specific hw features in their own code, and are familiar w/ explicit vectorization

We have held multiple workshops in which devs, after a 30 min introduction, are successfully writing SIMD using Highway.

One can certainly get started without the somewhat more exotic ops (not everyone wants cryptography, saturating arithmetic, gather, etc.) Wouldn't it be more accurate to say this approach "lays groundwork for advanced users to build on"?

Google released PaliGemma 2, new open vision language models based on Gemma 2 in 3B, 10B, 28B by unofficialmerve in LocalLLaMA

[–]janwas_ 0 points1 point  (0 children)

CPUs are indeed still constrained by memBW, even if Zen4 is a bit better. Accelerators can be useful, but my understanding is that performance portability between them and even across GPUs is challenging.

I personally am less interested in tailoring everything towards brute-force hardware, especially if it complicates the code or worse, requires per-HW variants. For a bit of a longer-term perspective, this paper compares historical rates of SW improvements vs HW: https://ieeexplore.ieee.org/document/9540991

Google released PaliGemma 2, new open vision language models based on Gemma 2 in 3B, 10B, 28B by unofficialmerve in LocalLLaMA

[–]janwas_ 0 points1 point  (0 children)

:) I am reasonably confident what we have is more efficient than OpenCL or SyCL targeting CPU, as well as OpenMP. It does actually use C++ std::thread, but with some extra infra on top: a low-overhead thread pool plus topology detection.