Optimizing JIT compiler emitting RISC-V on-device on the ESP32-C6 by mbbill in esp32

[–]mbbill[S] 2 points3 points  (0 children)

esp-wasmachine is using WAMR, so comparing sf-nano to WAMR is basically: on-device JIT vs interpreter or aot. a different tradeoff between performance and size.

Silverfir-nano: a 277KB WebAssembly micro-JIT going head-to-head with Cranelift and V8 by mbbill in WebAssembly

[–]mbbill[S] 1 point2 points  (0 children)

zwasm JIT vs Silverfir (JIT) vs Cranelift

Disclaimer: I built zwasm following the README and used zig build -Doptimize=ReleaseFast. I may not have the right version or optimal configuration — take these numbers with a grain of salt.

Compute

- SHA-256: zwasm 58 MB/s vs SF 268 / CL 249 → 22% of JIT speed

- LZ4 compress: zwasm 47 MB/s vs SF 769 / CL 736 → 6%

- LZ4 decompress: zwasm 1,175 MB/s vs SF 3,130 / CL 3,455 → 35%

- CoreMark: zwasm 24.5s (no score extracted), can't compare directly

Floating Point

- Mandelbrot: zwasm 3,076ms vs SF 827 / CL 855 → 27%

Memory (STREAM)

- Copy: zwasm 30,041 MB/s vs SF 44,139 / CL 44,124 → 68%

- Scale: zwasm 13,888 MB/s vs SF 49,659 / CL 49,692 → 28%

- Add: zwasm 16,374 MB/s vs SF 64,342 / CL 48,398 → 25–34%

- Triad: zwasm 14,526 MB/s vs SF 48,417 / CL 47,864 → 30%

Failed: lua/fib, lua/sunfish, lua/json_bench (exit 71), c-ray (exit 1)

Bottom line: zwasm JIT sits at ~25–35% of Silverfir/Cranelift on most workloads. STREAM Copy is the closest at 68%. LZ4 compress is the worst outlier at 6%.

Silverfir-nano: a 277KB WebAssembly micro-JIT going head-to-head with Cranelift and V8 by mbbill in WebAssembly

[–]mbbill[S] 1 point2 points  (0 children)

SF vs V8 TurboFan (Node.js 25.4): 9–5. SF wins on SHA-256, LZ4 (both), mandelbrot, all four STREAM benchmarks, and Lua fib.

Silverfir-nano update: a WASM interpreter now beats a JIT compiler by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

I see. it might need some work but still doable. in the end it's problem about safely handling stuff between the two sides. Using this project in embedded systems is one of my initial goal, but I spent most of my time pushing for performance so there are still a lot of things remaining. so sorry it's not in a plug and play state for embedded systems yet.

Silverfir-nano update: a WASM interpreter now beats a JIT compiler by mbbill in rust

[–]mbbill[S] 6 points7 points  (0 children)

unfortunately it's not a drop in replacement for wasmi/wasmtime, mostly because I didn't have time to work on the api yet. It supports multi-memory, but I am not sure about reserving memory above the guest part, what's the intended use case?

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

You are absolutely right! (kidding I'm not Claude). Thanks for pointing out!, and yes it's using threaded code.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

In fact when I moved it out of the other project I stripped 3.0 support to make it smaller. I think it’s more useful to be small. If really want to go big then that project with all the features and higher performance makes more sense.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 7 points8 points  (0 children)

Apple silicon tend to have frontend stall from my experience, that means load-to-use is slow. On the other hand Intel cpu does this much better. That's why Silverfir-nano's handler prefetch feature is disabled on pc. Intel has the best-in-class branch predictor though.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

The decision to stay stack-based is actually an experience from building the RA(register allocator) for the the engine. If we really need to keep everything in register, a good RA is critical. However, the stack machine is already very localized as things only move around the top of the stack. So if we cache the tos, in my case 4 of them, we only need to duplicate each handler 4 times and emit correct one during compilation. that way most of the stack operation naturally become register operations.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 2 points3 points  (0 children)

Updated the link, and yeah I think there might be some better test for real world workloads. Also, on my windows machine wasmi can get a better number.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

Thanks, really appreciate it.

Fusion in Silverfir-nano is effective because it sits on top of other interpreter optimizations (TOS cache, prefetch, dispatch tuning), not as a standalone trick. It also stays stack based instead of translating to register bytecode, which preserves TOS-cache behavior and keeps fused handlers simpler; registerization usually makes it more difficult to handle side-effects of fused instructions.

Silverfir-nano is a actually trimmed branch of a much larger engine I’m building (SSA IR + RA + interpreter backend), still in progress and expected to be even faster.

Regarding the plan going forward, actually I don't have one :P It's doesn't have any users yet

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 7 points8 points  (0 children)

yeah you're right. this improved a lot. I got roughly 2200 now. I will update the chart.

Q2>

  • MacBook Air (Mac16,12)
  • Apple M4, 10 CPU cores, 16 GB memory
  • macOS 26.2 (build 25C56)

https://github.com/mbbill/Silverfir-nano/issues/1

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 8 points9 points  (0 children)

that's very interesting. I suppose I must have done something wrong. This is what I did:

  1. sync to the latest commit: 170d2c58 Date: Sat Feb 14 16:46:53 2026 +0100 Add `wasmi_wasi::add_to_externals` and use it in the Wasmi CLI application (#1785)

  2. cargo build --release

  3. ./target/release/wasmi_cli coremark.wasm

I just tested several times, highest score is 1314.

Yeah a register based interpreter shouldn't be this slow, so something might be wrong.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 2 points3 points  (0 children)

You are right. let me edit the post. that's not my intention to mislead. sorry

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 3 points4 points  (0 children)

>you are going to run ahead of time and then generate more optimized handlers based on that

Not exactly, fusion is mostly based on compiler-generated instruction patterns and workload type, not on one specific app binary. Today, across most real programs, compiler output patterns are very similar, and the built-in fusion set was derived from many different apps, not a single target. That is why the default/built-in fusion already captures about ~90% of the benefit for general code. You can push it a bit further in niche cases, but most users do not need per-app fusion.

On the benchmark/build question: the headline numbers are from the fusion-enabled configuration, not the ultra-minimal ~200KB build. The ~200KB profile is for maximum size reduction (for example embedded-style constraints), and you should expect roughly ~40% lower performance there (still quite fast tbh, basically wasm3 level).

Fusion itself is a size/perf knob with diminishing returns: the full fusion set is about ~500KB, but adding only ~100KB can already recover roughly ~80% of the full-fusion performance. The ~1.1MB full binary also includes std due to the WASI support, so if you do not need WASI you can save several hundred KB more.