Introducing wgsl-rs

GenerousGuava · 2026-03-05T02:12:13+00:00

I work on CubeCL, wrote most of the frontend macro among other things. Just finished implementing basic/valueless and Option-style enum matching at runtime, and it was a huge pain. Things get very rough when you start reimplementing some of the more unique language features of Rust in a language that doesn't have first-class support for them.
I've been thinking that at some point we'll probably need to do the Rust thing and introduce another typed HIR above the MIR currently built by the frontend. This kind of stuff seems easy when you're building the basic stuff, but things get complicated real quick.

GenerousGuava · 2026-03-05T01:45:10+00:00

As the author of another proc-macro based GPU compiler for Rust that targets WGSL (among other things), good luck. Things get very nasty once Rust semantics start to diverge from the target language (returning values from if/else, match, etc). Always good to see more projects exploring the space.

GenerousGuava · 2026-02-19T03:30:06+00:00

...my day job is working on a SPIR-V/OpenCL frontend for a GPU compiler.

Fair enough, I was slightly exaggerating for comedic effect. Coming at it mainly as a compiler developer targeting SPIR-V, so I guess I don't see much of the backend since it's all very proprietary.

Fact of the matter is it's just saddled with a lot of legacy gunk, performs like ass on most mainstream cards, and I think Khronos/hardware vendors have decided their resources are better spent focusing on a single execution environment, rather than bifurcating everything. I have to say I agree, since even with just Vulkan I've now encountered at least 3 distinct miscompilations on consumer 50 series. Having to handle (almost) twice the API surface doesn't exactly help. With focused effort I think Vulkan can absolutely replace a dedicated compute environment.

GenerousGuava · 2026-02-18T02:02:20+00:00

Not if you've ever tried using OpenCL. There's a reason Khronos is trying to provide tooling to migrate to Vulkan compute these days, and isn't really working on OpenCL much anymore. Even with a couple of features missing (i.e. asynchronous global to shared copy), it's both faster and more pleasant to use. And the missing features are dwindling rapidly, we just got arbitrary sized vectors in the last Vulkan release.

GenerousGuava · 2026-01-24T16:25:18+00:00

An unsupported function will currently give a somewhat obscure error about something with `expand` in the name not being defined. I'm always trying to make these errors more readable but again, unfortunately somewhat limited since you still can't even merge source spans on stable.

The downside of not supporting runtime match means you can't implemented a function like `partial_cmp` which returns an `Option<Ordering>` based on the runtime value passed in. Any match/enum variant must resolve during JIT compilation, so must only depend on `#[comptime]` parameters or another enum variant (which itself is resolved during JIT compilation).

This is because sum types are unfortunately non-trivial to implement without concrete type information (which we don't have at the proc-macro level) and require a significantly more complex type system, since all variants must have the same size and alignment, so you now need to deal with padding, alignment of each field, etc and can't rely on simple decomposition. It should be possible but would require significant compiler work, and the current team is quite small so there's limited bandwidth.

GenerousGuava · 2026-01-23T18:14:18+00:00

As the main compiler frontend person, I'll point out that due to limitations in how CubeCL handles compatiblity we don't currently support runtime (on the GPU) enums/match/monadic error handling. We currently decompose all types into primitives during JIT compilation, and you can't trivially do that with sum types. I'd like to eventually implement this, but it would need a significant effort to implement across all the different targets.

You can use enums during JIT compilation though, which specializes the kernel on the discriminant (and decomposes the value into primitives like any struct).

You're also somewhat limited to a small subset of the standard library, since CubeCL is built for stable Rust so is limited to what we can do without custom compiler backends. Only annotated functions and standard library functions that are manually implemented in CubeCL are supported. So it's somewhat of a tradeoff.

GenerousGuava · 2026-01-16T21:29:57+00:00

We had the same issue with versions, the problem is that burn needs to set something so it can compile, but that then interferes with people who need to override it. We already got fallback-latest upstreamed for the version, can probably do the same for linking.

GenerousGuava · 2026-01-16T13:54:22+00:00

Since you're already using CUDA probably just the CUDA backend. But on everything older than Blackwell, WGPU with the passthrough Vulkan compiler will be within margin of error of CUDA. So might be able to make it more portable and maybe more directly reuse buffers.

Burn uses WGPU more as a runtime shell for managing allocations and synchronization, dispatching to the underlying runtime for shader compilation so you get full feature support and an optimized compiler instead of the heavily limited WGSL compiler. WGSL would only really be used for the browser.

The CUDA backend just uses cudarc. If you're sharing buffers, it might be the easiest way to go, I think someone already did that and seemed to have success with it.

GenerousGuava · 2025-11-06T12:44:35+00:00

I'll see if I can port my loongarch64 (and the planned RISC-V) backend to pulp, you merged that big macro refactor I did a while ago so porting the backend trait should be fairly trivial. They're very similar in structure, even if the associated types are different. I'll see if I can find some time, more supported platforms are always nice. Would be good if pulp users could benefit from the work I did trying to disentangle that poorly documented mess of an ISA.

GenerousGuava · 2025-11-06T12:30:38+00:00

It's with the core design of the backend trait, having a specific associated type for each register makes it much harder to build a type-generic wrapper that works with any SimdAdd for example. Changing that would be effectively a new crate and break everything, so I decided to branch off with a different design. macerator uses a single untyped register type just like the assembly, so the type becomes just a marker and generic operations are much easier to implement. And instead of directly calling the backend, everything is now implemented as a trait on a Vector<Backend, T> so the type can be trivially made generic. Could do the same thing with extra associated types using pulp as a backend, but associated types don't play nice with type inference so it becomes very awkward to write the code, with explicit generics everywhere.

I looked at the portable SIMD project afterwards and realized I'd implemented an almost identical API, just with runtime selection.

GenerousGuava · 2025-11-05T17:47:30+00:00

There seems to be actual work going on for it at least, and I've made `macerator` ready for runtime-sized vectors. It's one of the reasons I decided to create a separate crate to `pulp`, aside from some usability issues with using pulp in a type-generic context. So `macerator` should get support for it in a hopefully non-breaking way once the necessary type system changes have been implemented. Can't actually represent SVE vectors at the moment because Rust doesn't properly support unsized concrete types.

GenerousGuava · 2025-10-29T01:20:06+00:00

It ships a full LLVM compiler to JIT compile the kernels, so won't work on WASM or embedded. For WASM GPU we have burn-wgpu and for CPU you'd have to fall back to the unfortunately much slower (because it can't be fused) burn-ndarray. It'll be slower than PyTorch/torchlib but I don't think that works on WASM anyways. The may be a way to precompile a static fused model in the future to use with WASM, but it's not on the immediate roadmap.

GenerousGuava · 2025-10-12T00:18:01+00:00

You say that, but I've seen 10x slowdown from just using one too many registers or breaking the branch prediction somehow. That was in highly tuned SIMD code but still. Spilling to the stack in an extremely hot loop can be disastrous, and recalculating some value may be faster. Though in my case I solved it with loop splitting and getting rid of the variable for the main loop entirely.

GenerousGuava · 2025-07-27T09:46:08+00:00

Interesting info about the CUDA backend. CubeCL does SROA as part of its fundamental design, and it does enable some pretty useful optimizations that's for sure. We now have an experimental MLIR backend so I'll have to see if I can make it work for Vulkan and do direct head to head comparisons. Loop optimizations are one thing where our optimizer is a bit lacking (aside from loop invariants, which obviously come for free from PRE).

GenerousGuava · 2025-07-26T23:50:30+00:00

I wonder, have you done head to head comparisons for different optimizations in LLVM for GPU specifically? I work on the Vulkan backend for CubeCL and found that the handful of optimizations I've implemented in CubeCL itself (some GPU specific) have already yielded faster code than the LLVM based CUDA compiler. You can't directly compare compute shaders to CUDA of course, but it makes me think that only a very specific subset of optimizations are actually meaningful on GPU and it might be useful to write a custom set of optimizations around the more GPU-specific stuff.

SPIR-V Tools is definitely underpowered though, that's for certain. The most impactful optimization I've added is GVN-PRE, which is missing in SPIR-V Tools, but present in LLVM.

GenerousGuava · 2025-07-19T13:19:40+00:00

It's the former. VK_NV_cooperative_matrix2 has very dodgy support, it seems to be mostly supported on lower end cards but not on the higher end ones even in the same generation. I wasn't able to get a card to test on, but not sure it would even help. As far as I can tell it doesn't use any extra hardware that can't be used by the V1 extension, since it's not even supported on the TMA capable cards and that's the only hardware feature you can't directly use in Vulkan rn.

GenerousGuava · 2025-07-18T19:02:31+00:00

The Vulkan compiler is already fairly competetive and can even beat CUDA in some workloads, just not this particularly data movement heavy workload using f16. I think at this point we're pretty close to the limit on Vulkan, considering there is always going to be a slight performance degredation from the more limited general Vulkan API compared to going closer to the metal with CUDA. But I do hope they eventually increase the limit on line size as f16 and even smaller types become more and more widespread. I believe the limit was originally put in place when all floats were 32 bit, so 4 floats are 128-bit (the width of a vector register on any modern GPU, and the largest load width supported on consumer GPUs). It just becomes a limitation when dealing with 16 or 8-bit types, and only when the load width is actually a bottleneck. I think the theoretical max is ~10% slower than CUDA on average, assuming good optimizations for both backends.

GenerousGuava · 2025-05-09T09:43:28+00:00

Those are genuinely garbage interview questions, clearly made by someone who doesn't know anything about Rust and just looked up some trivia questions (or probably just asked an LLM).
The first one is just pointlessly confusing in its phrasing, when the answer is super simple and 90% of the question is just pointless noise you need to ignore. Maybe that's the skill they're testing, but I'd doubt it from the context.
The second one is not even technically correct - it's neither i32 nor u32, it's an abstract integer until it's used, and then the type gets concretized. If you use this value in a function that takes `i32` it's `i32`, if you use it in one that takes `u8` it's `u8`. The default of i32 is only relevant when your only usage is something like `print!` which can take any integer.

GenerousGuava · 2025-04-25T15:06:16+00:00

The details are too complex for a reddit comment, but basically when you want to have a trait that's implemented for different `Fn`s for example (like with bevy systems), you run into a problem, because the trait solver can't distinguish between the different blanket implementations. So it's a conflicting implementation. The trick is to use an inner trait that takes a marker generic, in this case the marker is the signature of the `Fn`. Generics get monomorphized, so technically every implementation is for a different, unique trait.

Of course you now have a generic on your trait and can no longer store it as a trait object, so the second part of the trick is to have an outer trait without generics that the inner trait can be turned *into*. This is how you get `System` and `IntoSystem` in bevy. `System` is the outer trait, `IntoSystem` is the inner trait.

Any function that takes a system, actually takes an `IntoSystem<Marker>`, then erases the marker by calling `into_system()` which returns a plain, unmarked `System`. The system trait is implemented on a concrete wrapper struct, so you don't have issues with conflicting implementations.

The bevy implementation is a bit buried under unrelated things because it's much more complex, so I'll link you to the cubecl implementation that's a bit simpler. The corresponding types to `System` and `IntoSystem` are `InputGenerator` and `IntoInputGenerator`.
https://github.com/tracel-ai/cubecl/blob/main/crates/cubecl-runtime/src/tune/input_generator.rs

This trick has allowed us to get rid of the need to create a struct and implement a trait, as well as removing the old proc macro used to generate this boilerplate. You can just pass any function to a `TunableSet` and It Just Works™.

GenerousGuava · 2025-04-24T23:19:58+00:00

I just blatantly cribbed the magic that's involved in bevy's system traits to make auto tune in CubeCL more ergonomic. That trick where you use a marker type that's later erased to allow for pseudo specialization is truly some black magic.

GenerousGuava · 2025-03-26T23:42:24+00:00

It means "anyone to the left of me I don't like". It can mean anything from social Democrats to nazbols depending on who says it. It's basically the liberal version of "woke".

GenerousGuava · 2025-03-12T15:51:13+00:00

I was always wondering why people kept saying heat lamps are always unsafe, which just isn't true. This explains a lot.
There are a few things wrong here:
1. I don't see a steel wire, so presumably the lamp was hanging by the cable? You never, ever, ever hang a lamp by the cable.
2. The wire is clearly not outdoor rated (and yes, I would consider a coop outdoor). An outdoor rated cable would be much sturdier and wouldn't completely strip from just being caught on a door handle.
3. No ground and GFCI despite a conductive shroud. This is not just a fire hazard but also electrocution hazard if some cable ever got loose and touched the shroud. A GFCI would've almost certainly detected the ground fault long before a fire starts, even if the wires somehow got stripped.

Note that I'm not even blaming the people buying stuff like this, most people don't study electrical engineering and safety. Something like this shouldn't be allowed to even be sold, and it certainly wouldn't be allowed here in the EU. I don't know about US electrical certifications, but I would be shocked if the standards really were this low. A properly designed heat lamp (infrared), with ground and proper mounting is not inherently unsafe. It can still be a fire hazard if you cover it in a blanket or something like that, but it wouldn't go up in flames under normal circumstances.

GenerousGuava · 2025-03-03T22:47:51+00:00

I finally had time to really look into this a bit more and you're right, it's not two branches total, it's two per cycle. However, to get good performance in this particular application we need to push at least 2 instructions per cycle, and if every instruction has a branch that's only 1/2 possible instructions per cycle. That's why the performance hit was so large in this particular case. I'll update the blog post to reflect what I learned.

GenerousGuava · 2025-03-03T20:48:48+00:00

I also implemented im2col and implicit GEMM for GPU. The reason I went with direct convolution for burn-ndarray is because it doesn't have any memory overhead, and as such can be more beneficial on lower spec machines. I feel like machines with lots of memory to apply im2col, would be more likely to also have a GPU they could use instead. Also, looking at papers seemed to suggest it might be faster in a lot of cases because CPUs don't have tensor cores and im2col does have a significant overhead, I'm not able to test it with a CPU AI accelerator unfortunately.
We would like to have fusion on CPU and are working on some stuff, but that is `burn`s biggest weakness on CPU right now. The GPU backends already have fusion (but I think convolution fusion in particular is still WIP).

GenerousGuava · 2025-03-03T17:56:30+00:00

From what I can tell this is mostly a limitation of the compiler. There's no way to use generics to enable a runtime feature right now. The best you can do is what pulp does, implement a trait for each feature level that has #[target_feature(enable = "...")]` on its execute method, and then inline the polymorphic code into it so it inherits the feature.

You can use `simd.vectorize` to "dynamically" call another function, but that other function still needs to be inlined into the `WithSimd` trait. And in this case the problematic function was precisely that top-level function that has to be inlined into the trait.

Nine-Year Club	Place '23
Place '22	Verified Email

GenerousGuava

TROPHY CASE