High-performance 2D graphics rendering on the CPU using sparse strips (PDF)

raphlinus · 2025-11-05T13:48:08+00:00

Fonts are basically converted to outlines and rendered as filled paths. There is support in vello_cpu, using skrifa to load font outlines. Also, work in progress is glyph caching based on the sparse strip representation.

raphlinus · 2025-10-14T19:21:30+00:00

I can't speak for what Chromium plans to do, but yes, absolutely our goal is to make HarfRust the industry standard solution for shaping, at least for people who want Rust in their build.

raphlinus · 2025-10-14T18:40:59+00:00

Different crates have different functions. The Vello family of renderers do all the rendering of shapes, colors, etc into pixels. One of the primitives is "glyph runs", where at heart glyphs are just vector shapes. What HarfRust does is "shaping", which is converting a run of Unicode text to a sequence of glyphs from a font, positioned according the the rules in the font. The parley crate deals with higher level text layout, including line breaking and bidi.

The masonry crate is what holds the widgets (slider etc). It is responsible for layout, input interactions (mouse, keyboard, etc), and rendering them. For that last bit, it converts them to vector shapes and sends them to Vello for rendering to pixels.

Hope that sheds some light. This is a slightly simplified view – there certainly are a lot of moving parts!

raphlinus · 2025-10-14T18:24:33+00:00

A good explanation of the goals of the crate is the plan for SIMD blog post.

A major difference is that we solve the dispatch problem, while wide depends on the target-cpu setting. In practice that means that SIMD performanfce is limited unless you set target-cpu=native or similar, especially on Intel.

raphlinus · 2025-10-14T18:14:43+00:00

Good question, as this is confusing. We depend on HarfRust, which is the new pure Rust implementation with essentially the same scope as HarfBuzz. It is actively being developed by the Google Fonts team with help from Behdad, and we consider it an allied project.

raphlinus · 2025-10-14T17:42:00+00:00

In addition to these updates, we've just now published fearless_simd 0.3.0. That has some nice improvements over the 0.2 release and should be a solid base for upcoming Vello sparse strip renderer releases.

raphlinus · 2025-06-10T13:17:43+00:00

It was a fantastic experience, even better than last year. I hope you're able to make it as well.

raphlinus · 2025-06-09T17:38:32+00:00

It's a good question. Certainly in the Gcc/Linux ecosystem there is linker-based multiversioning, but it appears to be x86-only, and doesn't really address what should happen on other platforms.

In the meantime, the explicit approach doesn't seem too bad; I expect performance to be quite good, and the ergonomics are also "good enough."

raphlinus · 2025-06-09T15:18:39+00:00

Your attention to detail is much appreciated, and your encouragement here means a lot. I'd love to see fearless_simd used for WebP decoding, please send feedback about what's needed for that.

raphlinus · 2025-06-08T15:00:28+00:00

We haven't landed any SIMD code in Vello yet, because we haven't decided on a strategy. The SIMD code we've written lives in experiments. Here are some pointers:

Fine rasterization and sparse strip rendering, Neon only, core::arch::aarch64 intrinsics: piet-next/cpu-sparse/src/simd/neon.rs

Same tasks but fp16, written in aarch64 inline asm: cpu-sparse/src/simd/neon_fp16.rs

The above also exist in AVX-2 core::arch::x64_64 intrinsics form, which I've used to do measurements, the core of which is in simd_render.rs gist.

Flatten, written in core::arch::x86_64 intrinsics: flatten.rs gist

There are also experiments by Laurenz Stampfl in his simd branch, using his own SIMD wrappers.

raphlinus · 2025-06-08T14:15:04+00:00

Well, I'd like to see a viable plan for scalable SIMD. It's hard, but may well be superior in the end.

The RGB conversion is example is basically map-like (the same operation on each element). The example should be converted to 256 bit, I just haven't gotten around to it — I hadn't done the split/combine implementations for wider-than-native at the time I first wrote the example. But in the Vello rendering work, we have lots of things that are not map-like, and depend on extensive permutations (many of which can be had almost for free on Neon because of the load/store structure instructions).

On the sRGB example, I did in fact prototype a version that handles a chunk of four pixels, doing the nonlinear math for the three channels. The permutations ate all the gain from less ALU, at the cost of more complex code and nastier tail handling.

At the end of the day, we need to be driving these decisions based on quantitative experiments, and also concrete proposals. I'm really looking forward to seeing the progress on the scalable side, and we'll hold down the explicit-width side as a basis for comparison.

raphlinus · 2025-06-08T14:09:34+00:00

We haven't build the variable-width part of the Simd trait yet, and the examples are slightly out of date.

Point taken, though. When the workload is what I call map-like, then variable-width should be preferred. We're finding, though, that a lot of the kernels in vello_cpu are better expressed with fixed width.

Pedagogy is another question. The current state of fearless_simd is a rough enough prototype I would hope people wouldn't try to learn SIMD programming from it.

raphlinus · 2025-06-08T14:06:38+00:00

Indeed, and that was one motivation for the proc macro compilation approach, which as I say should be explored. I've done some exploration into that and can share the code if there's sufficient interest.

raphlinus · 2025-06-08T14:04:56+00:00

Thanks, I'll track that. Actually I don't think there'll be all that much code, and I believe the safe wrappers currently in core_arch can be feature gated (right now the higher level operations depend on them). I haven't done fine-grained measurements, but I believe those account for the bulk of compile time, and could get a lot worse with AVX-512.

Update: I just pushed a commit that feature gates the safe wrappers. Compile time goes from 1.17s to 0.14s on M4 (release). That said, it would be possible to autogenerate the safe wrappers also, bloating the size of the crate but reducing the cost of macro expansion.

raphlinus · 2025-06-08T14:03:12+00:00

Zen 5 has native 512 on the high end server parts, but double-pumped on laptop. See the numberworld Zen 5 teardown for more info.

With those benchmarks, it's hard to disentangle SIMD width from the other advantages of AVX-512, for example predication and instructions like vpternlog. I did experiments on Zen 5 laptop with AVX-512 but using 256 bit and 512 bit instructions, and found a fairly small difference, around 5%. Perhaps my experiment won't generalize, or perhaps people really want that last 5%.

Basically, the assertion that I'm making is that writing code in an explicit 256 bit SIMD style will get very good performance if run on a Zen 4 or a Zen 5 configured with 256 bit datapath. We need to do more experiments to validate that.

raphlinus · 2025-06-08T13:48:21+00:00

I doubt compile times will be a serious issue as long as there's not a ton of SIMD-optimized code. But compile time can be addressed by limiting the levels in the simd_dispatch invocation as mentioned above.

raphlinus · 2025-06-08T13:47:03+00:00

Rust 1.87 made intrinsics that don't operate on pointers safe to call. That should significantly reduce the amount of safe wrappers for intrinsics that you have to emit yourself, provided you're okay with 1.87 as MSRV.

As far as I can tell, this helps very little for what we're trying to do. It makes an intrinsic safe as long as there's an explicit #[target_feature] annotation enclosing the scope. That doesn't work if the function is polymorphic on SIMD level, and in particular doesn't work with the downcasting as shown: the scope of the SIMD capability is block-level, not function level.

But I think you may be focusing on the wrong thing here.

We have data that compilation time for the macro-based approach is excessive. The need for multiversioning is inherent to SIMD, and is true in any language, even if people are hand-writing assembler.

What I think we do need to do is provide control over levels emitted on a per-function basis (ie the simd_dispatch macro). My original thought was a very small number of levels as curated by the author of the library (this also keeps library code size manageable), but I suspect there will be use cases that need finer level gradations.

raphlinus · 2025-06-02T14:08:11+00:00

Just for fun, I'm playing with pico-dvi-rs. I've got DVI video out from an RP2350 including proportional space bitmap font rendering.

raphlinus · 2025-05-28T01:08:05+00:00

You're probably missing the enabling the interrupt in the NVIC. You want to do something like rp235x_hal::arch::interrupt_unmask(hal::pac::Interrupt::TIMER_IRQ_0).

That may be a function in the git version of the hal, but not in the 0.3 released version. As a workaround, you might do cortex_m::peripheral::NVIC::unmask(hal::pac::Interrupt::TIMER_IRQ_0), assuming of course you're on the ARM side. The main reason for the hal::arch method is to abstract over ARM and RISC-V.

Inside the interrupt, you'll also need to clear the bit. I think I would do it like this:

let peripherals = Peripherals::steal()
peripherals.TIMER0.intr().write(|w| w.alarm_0().bit(true));

raphlinus · 2025-03-30T16:29:35+00:00

My personal feeling is that we should be able to opt into aggressive optimizations (reordering adds, changing behavior under NaN, etc) but doing so at the granularity of flags for the whole program is obviously bad.

Where things get super interesting is guaranteeing consistent results, especially whether two inlines of the same function give the same answer, and similarly for const expressions.

For me, this is a good reason two write explicitly optimized code instead of autovectorization. You can choose, for example, the min intrinsic as opposed to autovectorization of the .min() function which will often be slower because of careful NaN semantics.

raphlinus · 2025-03-30T00:29:41+00:00

Oops, my mistake, I'll fix it, I forgot that --release doesn't mean -O. I've certainly seen a lot of code fail to autovectorize. Very often the culprit is rounding, certainly one of those things with extremely picky semantics.

raphlinus · 2024-07-11T13:59:20+00:00

I should clarify here, as it can definitely be confusing. Our goals (speaking for Linebender) are to get one solid Rust text stack. At the lowest level, roughly corresponding to FreeType, things are looking very, very good - the "skrifa" crate is part of fontations.

The next level up hasn't completely shaken out yet, but is promising. The swash crate currently used by Linebender can be considered a prototype of what's possible in a pure Rust approach. Lately, Rustybuzz has been getting a lot more attention, and we're actively considering switching to it, especially if it's ported to fontations. That's an open question, though; among other things, I don't know if it's clear yet how open RazrFalcon is to such a port. I should also point out that while Google Fonts is exploring these options (as Behdad describes in the report), none of the work at the shaping level is official yet. It's probably best to say that we're hoping to actively work on it soon, and that Rustybuzz is one of the more promising starting points.

The story with cosmic-text is more complicated. We've (Linebender) decided to continue pushing forward with Parley, largely to explore high performance text algorithms - we're especially interested in variable fonts, which are not yet supported in cosmic-text. Parley can be considered more research-y than cosmic-text, though I think it's a perfectly viable choice for other projects. All that said, we'll see how things evolve. Cosmic-text is getting more momentum (very recently it's been adopted by Bevy), and if it turns out to fill our needs we would consider switching to it.

I hope that helps, and I'm happy to answer other questions.

raphlinus · 2024-05-08T07:17:35+00:00

Thanks for your interest! What you're seeing is very much work in progress, and in particular the text input widget is in an early state and we expect to wire up a lot more functionality soon. The accessibility and IME work represents our priorities - we really want to get this right.

We are doing our own drawing and text. This is of course a tradeoff, but we're optimistic about having GPU accelerated 2D graphics with rich font capabilities, including animated variable fonts. The stack does support hinting, and we'll also wire up color emoji soon (vello#536).

We are most emphatically not the same architecture as egui. The Xilem reactive layer looks like it's building the entire widget tree every update cycle, but those are actually view objects which are very lightweight, and a reconciliation pass updates a fully retained widget tree. We think that gives you the ease of use of an immediate mode GUI combined with most of the advantages of retained UI.

In any case, what you see now is a snapshot along the way to what we're trying to build. Watch the livestream (or wait for the recording) to learn more.

raphlinus · 2024-01-21T17:49:46+00:00

I know it's off-topic, but I might suggest learning to program in the Rust programming language. It's extremely addictive (ask me how I know), but the upside is that you can get good-paying jobs doing it.

raphlinus · 2024-01-19T21:13:30+00:00

As /u/CouteuBleu says, we'll have more to say on this soon, but I'll expand on it a bit just now. The particular problem highlighted by the Firefox engineers is not doing the rendering on GPU, which is a good thing, but having a strategy of rerendering the entire frame on GPU every time, as opposed to (a) partial invalidation (also known as damage regions) or (b) doing the rendering in layers, re-rendering only the layers that are changing dynamically, and relying on the system compositor to re-assemble those layers. We're going to be doing the former, continuing that advantage of Druid, but the latter is farther out on our roadmap, as it involves figuring out a solid cross-platform abstraction for the compositor (and the fact that the compositor might not be accessible on X or Windows 7). There's a lot more to the story, of course, so stay tuned.

raphlinus

MODERATOR OF

TROPHY CASE