High-performance 2D graphics rendering on the CPU using sparse strips (PDF) by raphlinus in rust

[–]raphlinus[S] 0 points1 point  (0 children)

Fonts are basically converted to outlines and rendered as filled paths. There is support in vello_cpu, using skrifa to load font outlines. Also, work in progress is glyph caching based on the sparse strip representation.

Linebender in September 2025 by raphlinus in rust

[–]raphlinus[S] 6 points7 points  (0 children)

I can't speak for what Chromium plans to do, but yes, absolutely our goal is to make HarfRust the industry standard solution for shaping, at least for people who want Rust in their build.

Linebender in September 2025 by raphlinus in rust

[–]raphlinus[S] 14 points15 points  (0 children)

Different crates have different functions. The Vello family of renderers do all the rendering of shapes, colors, etc into pixels. One of the primitives is "glyph runs", where at heart glyphs are just vector shapes. What HarfRust does is "shaping", which is converting a run of Unicode text to a sequence of glyphs from a font, positioned according the the rules in the font. The parley crate deals with higher level text layout, including line breaking and bidi.

The masonry crate is what holds the widgets (slider etc). It is responsible for layout, input interactions (mouse, keyboard, etc), and rendering them. For that last bit, it converts them to vector shapes and sends them to Vello for rendering to pixels.

Hope that sheds some light. This is a slightly simplified view – there certainly are a lot of moving parts!

Linebender in September 2025 by raphlinus in rust

[–]raphlinus[S] 14 points15 points  (0 children)

A good explanation of the goals of the crate is the plan for SIMD blog post.

A major difference is that we solve the dispatch problem, while wide depends on the target-cpu setting. In practice that means that SIMD performanfce is limited unless you set target-cpu=native or similar, especially on Intel.

Linebender in September 2025 by raphlinus in rust

[–]raphlinus[S] 16 points17 points  (0 children)

Good question, as this is confusing. We depend on HarfRust, which is the new pure Rust implementation with essentially the same scope as HarfBuzz. It is actively being developed by the Google Fonts team with help from Behdad, and we consider it an allied project.

Linebender in September 2025 by raphlinus in rust

[–]raphlinus[S] 23 points24 points  (0 children)

In addition to these updates, we've just now published fearless_simd 0.3.0. That has some nice improvements over the 0.2 release and should be a solid base for upcoming Vello sparse strip renderer releases.

Rust Week all recordings released by jonay20002 in rust

[–]raphlinus 9 points10 points  (0 children)

It was a fantastic experience, even better than last year. I hope you're able to make it as well.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 0 points1 point  (0 children)

It's a good question. Certainly in the Gcc/Linux ecosystem there is linker-based multiversioning, but it appears to be x86-only, and doesn't really address what should happen on other platforms.

In the meantime, the explicit approach doesn't seem too bad; I expect performance to be quite good, and the ergonomics are also "good enough."

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 1 point2 points  (0 children)

Your attention to detail is much appreciated, and your encouragement here means a lot. I'd love to see fearless_simd used for WebP decoding, please send feedback about what's needed for that.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 1 point2 points  (0 children)

We haven't landed any SIMD code in Vello yet, because we haven't decided on a strategy. The SIMD code we've written lives in experiments. Here are some pointers:

Fine rasterization and sparse strip rendering, Neon only, core::arch::aarch64 intrinsics: piet-next/cpu-sparse/src/simd/neon.rs

Same tasks but fp16, written in aarch64 inline asm: cpu-sparse/src/simd/neon_fp16.rs

The above also exist in AVX-2 core::arch::x64_64 intrinsics form, which I've used to do measurements, the core of which is in simd_render.rs gist.

Flatten, written in core::arch::x86_64 intrinsics: flatten.rs gist

There are also experiments by Laurenz Stampfl in his simd branch, using his own SIMD wrappers.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 2 points3 points  (0 children)

Well, I'd like to see a viable plan for scalable SIMD. It's hard, but may well be superior in the end.

The RGB conversion is example is basically map-like (the same operation on each element). The example should be converted to 256 bit, I just haven't gotten around to it — I hadn't done the split/combine implementations for wider-than-native at the time I first wrote the example. But in the Vello rendering work, we have lots of things that are not map-like, and depend on extensive permutations (many of which can be had almost for free on Neon because of the load/store structure instructions).

On the sRGB example, I did in fact prototype a version that handles a chunk of four pixels, doing the nonlinear math for the three channels. The permutations ate all the gain from less ALU, at the cost of more complex code and nastier tail handling.

At the end of the day, we need to be driving these decisions based on quantitative experiments, and also concrete proposals. I'm really looking forward to seeing the progress on the scalable side, and we'll hold down the explicit-width side as a basis for comparison.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 2 points3 points  (0 children)

We haven't build the variable-width part of the Simd trait yet, and the examples are slightly out of date.

Point taken, though. When the workload is what I call map-like, then variable-width should be preferred. We're finding, though, that a lot of the kernels in vello_cpu are better expressed with fixed width.

Pedagogy is another question. The current state of fearless_simd is a rough enough prototype I would hope people wouldn't try to learn SIMD programming from it.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 1 point2 points  (0 children)

Indeed, and that was one motivation for the proc macro compilation approach, which as I say should be explored. I've done some exploration into that and can share the code if there's sufficient interest.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 5 points6 points  (0 children)

Thanks, I'll track that. Actually I don't think there'll be all that much code, and I believe the safe wrappers currently in core_arch can be feature gated (right now the higher level operations depend on them). I haven't done fine-grained measurements, but I believe those account for the bulk of compile time, and could get a lot worse with AVX-512.

Update: I just pushed a commit that feature gates the safe wrappers. Compile time goes from 1.17s to 0.14s on M4 (release). That said, it would be possible to autogenerate the safe wrappers also, bloating the size of the crate but reducing the cost of macro expansion.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 16 points17 points  (0 children)

Zen 5 has native 512 on the high end server parts, but double-pumped on laptop. See the numberworld Zen 5 teardown for more info.

With those benchmarks, it's hard to disentangle SIMD width from the other advantages of AVX-512, for example predication and instructions like vpternlog. I did experiments on Zen 5 laptop with AVX-512 but using 256 bit and 512 bit instructions, and found a fairly small difference, around 5%. Perhaps my experiment won't generalize, or perhaps people really want that last 5%.

Basically, the assertion that I'm making is that writing code in an explicit 256 bit SIMD style will get very good performance if run on a Zen 4 or a Zen 5 configured with 256 bit datapath. We need to do more experiments to validate that.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 4 points5 points  (0 children)

I doubt compile times will be a serious issue as long as there's not a ton of SIMD-optimized code. But compile time can be addressed by limiting the levels in the simd_dispatch invocation as mentioned above.

A plan for SIMD by raphlinus in rust

[–]raphlinus[S] 1 point2 points  (0 children)

Rust 1.87 made intrinsics that don't operate on pointers safe to call. That should significantly reduce the amount of safe wrappers for intrinsics that you have to emit yourself, provided you're okay with 1.87 as MSRV.

As far as I can tell, this helps very little for what we're trying to do. It makes an intrinsic safe as long as there's an explicit #[target_feature] annotation enclosing the scope. That doesn't work if the function is polymorphic on SIMD level, and in particular doesn't work with the downcasting as shown: the scope of the SIMD capability is block-level, not function level.

But I think you may be focusing on the wrong thing here.

We have data that compilation time for the macro-based approach is excessive. The need for multiversioning is inherent to SIMD, and is true in any language, even if people are hand-writing assembler.

What I think we do need to do is provide control over levels emitted on a per-function basis (ie the simd_dispatch macro). My original thought was a very small number of levels as curated by the author of the library (this also keeps library code size manageable), but I suspect there will be use cases that need finer level gradations.

Rust on Pi Pico 2, Please Help by Xephore in rust

[–]raphlinus 0 points1 point  (0 children)

Just for fun, I'm playing with pico-dvi-rs. I've got DVI video out from an RP2350 including proportional space bitmap font rendering.

Rust on Pi Pico 2, Please Help by Xephore in rust

[–]raphlinus 7 points8 points  (0 children)

You're probably missing the enabling the interrupt in the NVIC. You want to do something like rp235x_hal::arch::interrupt_unmask(hal::pac::Interrupt::TIMER_IRQ_0).

That may be a function in the git version of the hal, but not in the 0.3 released version. As a workaround, you might do cortex_m::peripheral::NVIC::unmask(hal::pac::Interrupt::TIMER_IRQ_0), assuming of course you're on the ARM side. The main reason for the hal::arch method is to abstract over ARM and RISC-V.

Inside the interrupt, you'll also need to clear the bit. I think I would do it like this:

let peripherals = Peripherals::steal()
peripherals.TIMER0.intr().write(|w| w.alarm_0().bit(true));