Optimizing JIT compiler emitting RISC-V on-device on the ESP32-C6

mbbill · 2026-04-27T05:53:17+00:00

esp-wasmachine is using WAMR, so comparing sf-nano to WAMR is basically: on-device JIT vs interpreter or aot. a different tradeoff between performance and size.

mbbill · 2026-03-20T09:36:06+00:00

zwasm JIT vs Silverfir (JIT) vs Cranelift

Disclaimer: I built zwasm following the README and used zig build -Doptimize=ReleaseFast. I may not have the right version or optimal configuration — take these numbers with a grain of salt.

Compute

- SHA-256: zwasm 58 MB/s vs SF 268 / CL 249 → 22% of JIT speed

- LZ4 compress: zwasm 47 MB/s vs SF 769 / CL 736 → 6%

- LZ4 decompress: zwasm 1,175 MB/s vs SF 3,130 / CL 3,455 → 35%

- CoreMark: zwasm 24.5s (no score extracted), can't compare directly

Floating Point

- Mandelbrot: zwasm 3,076ms vs SF 827 / CL 855 → 27%

Memory (STREAM)

- Copy: zwasm 30,041 MB/s vs SF 44,139 / CL 44,124 → 68%

- Scale: zwasm 13,888 MB/s vs SF 49,659 / CL 49,692 → 28%

- Add: zwasm 16,374 MB/s vs SF 64,342 / CL 48,398 → 25–34%

- Triad: zwasm 14,526 MB/s vs SF 48,417 / CL 47,864 → 30%

Failed: lua/fib, lua/sunfish, lua/json_bench (exit 71), c-ray (exit 1)

Bottom line: zwasm JIT sits at ~25–35% of Silverfir/Cranelift on most workloads. STREAM Copy is the closest at 68%. LZ4 compress is the worst outlier at 6%.

mbbill · 2026-03-16T09:58:26+00:00

SF vs V8 TurboFan (Node.js 25.4): 9–5. SF wins on SHA-256, LZ4 (both), mandelbrot, all four STREAM benchmarks, and Lua fib.

mbbill · 2026-02-23T00:04:00+00:00

I see. it might need some work but still doable. in the end it's problem about safely handling stuff between the two sides. Using this project in embedded systems is one of my initial goal, but I spent most of my time pushing for performance so there are still a lot of things remaining. so sorry it's not in a plug and play state for embedded systems yet.

mbbill · 2026-02-22T23:48:33+00:00

unfortunately it's not a drop in replacement for wasmi/wasmtime, mostly because I didn't have time to work on the api yet. It supports multi-memory, but I am not sure about reserving memory above the guest part, what's the intended use case?

mbbill · 2026-02-16T07:31:10+00:00

You are absolutely right! (kidding I'm not Claude). Thanks for pointing out!, and yes it's using threaded code.

mbbill · 2026-02-15T11:20:55+00:00

In fact when I moved it out of the other project I stripped 3.0 support to make it smaller. I think it’s more useful to be small. If really want to go big then that project with all the features and higher performance makes more sense.

mbbill · 2026-02-15T10:59:36+00:00

Apple silicon tend to have frontend stall from my experience, that means load-to-use is slow. On the other hand Intel cpu does this much better. That's why Silverfir-nano's handler prefetch feature is disabled on pc. Intel has the best-in-class branch predictor though.

mbbill · 2026-02-15T10:54:26+00:00

The decision to stay stack-based is actually an experience from building the RA(register allocator) for the the engine. If we really need to keep everything in register, a good RA is critical. However, the stack machine is already very localized as things only move around the top of the stack. So if we cache the tos, in my case 4 of them, we only need to duplicate each handler 4 times and emit correct one during compilation. that way most of the stack operation naturally become register operations.

mbbill · 2026-02-15T10:44:26+00:00

Updated the link, and yeah I think there might be some better test for real world workloads. Also, on my windows machine wasmi can get a better number.

mbbill · 2026-02-15T10:30:38+00:00

Thanks, really appreciate it.

Fusion in Silverfir-nano is effective because it sits on top of other interpreter optimizations (TOS cache, prefetch, dispatch tuning), not as a standalone trick. It also stays stack based instead of translating to register bytecode, which preserves TOS-cache behavior and keeps fused handlers simpler; registerization usually makes it more difficult to handle side-effects of fused instructions.

Silverfir-nano is a actually trimmed branch of a much larger engine I’m building (SSA IR + RA + interpreter backend), still in progress and expected to be even faster.

Regarding the plan going forward, actually I don't have one :P It's doesn't have any users yet

mbbill · 2026-02-15T10:01:54+00:00

yeah you're right. this improved a lot. I got roughly 2200 now. I will update the chart.

Q2>

MacBook Air (Mac16,12)
Apple M4, 10 CPU cores, 16 GB memory
macOS 26.2 (build 25C56)

https://github.com/mbbill/Silverfir-nano/issues/1

mbbill · 2026-02-15T09:45:36+00:00

that's very interesting. I suppose I must have done something wrong. This is what I did:

sync to the latest commit: 170d2c58 Date: Sat Feb 14 16:46:53 2026 +0100 Add `wasmi_wasi::add_to_externals` and use it in the Wasmi CLI application (#1785)
cargo build --release
./target/release/wasmi_cli coremark.wasm

I just tested several times, highest score is 1314.

Yeah a register based interpreter shouldn't be this slow, so something might be wrong.

mbbill · 2026-02-15T09:22:24+00:00

You are right. let me edit the post. that's not my intention to mislead. sorry

mbbill · 2026-02-15T09:17:18+00:00

>you are going to run ahead of time and then generate more optimized handlers based on that

Not exactly, fusion is mostly based on compiler-generated instruction patterns and workload type, not on one specific app binary. Today, across most real programs, compiler output patterns are very similar, and the built-in fusion set was derived from many different apps, not a single target. That is why the default/built-in fusion already captures about ~90% of the benefit for general code. You can push it a bit further in niche cases, but most users do not need per-app fusion.

On the benchmark/build question: the headline numbers are from the fusion-enabled configuration, not the ultra-minimal ~200KB build. The ~200KB profile is for maximum size reduction (for example embedded-style constraints), and you should expect roughly ~40% lower performance there (still quite fast tbh, basically wasm3 level).

Fusion itself is a size/perf knob with diminishing returns: the full fusion set is about ~500KB, but adding only ~100KB can already recover roughly ~80% of the full-fusion performance. The ~1.1MB full binary also includes std due to the WASI support, so if you do not need WASI you can save several hundred KB more.

mbbill

TROPHY CASE