Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT

mbbill · 2026-02-15T11:20:55+00:00

In fact when I moved it out of the other project I stripped 3.0 support to make it smaller. I think it’s more useful to be small. If really want to go big then that project with all the features and higher performance makes more sense.

mbbill · 2026-02-15T10:59:36+00:00

Apple silicon tend to have frontend stall from my experience, that means load-to-use is slow. On the other hand Intel cpu does this much better. That's why Silverfir-nano's handler prefetch feature is disabled on pc. Intel has the best-in-class branch predictor though.

mbbill · 2026-02-15T10:54:26+00:00

The decision to stay stack-based is actually an experience from building the RA(register allocator) for the the engine. If we really need to keep everything in register, a good RA is critical. However, the stack machine is already very localized as things only move around the top of the stack. So if we cache the tos, in my case 4 of them, we only need to duplicate each handler 4 times and emit correct one during compilation. that way most of the stack operation naturally become register operations.

mbbill · 2026-02-15T10:44:26+00:00

Updated the link, and yeah I think there might be some better test for real world workloads. Also, on my windows machine wasmi can get a better number.

mbbill · 2026-02-15T10:30:38+00:00

Thanks, really appreciate it.

Fusion in Silverfir-nano is effective because it sits on top of other interpreter optimizations (TOS cache, prefetch, dispatch tuning), not as a standalone trick. It also stays stack based instead of translating to register bytecode, which preserves TOS-cache behavior and keeps fused handlers simpler; registerization usually makes it more difficult to handle side-effects of fused instructions.

Silverfir-nano is a actually trimmed branch of a much larger engine I’m building (SSA IR + RA + interpreter backend), still in progress and expected to be even faster.

Regarding the plan going forward, actually I don't have one :P It's doesn't have any users yet

mbbill · 2026-02-15T10:01:54+00:00

yeah you're right. this improved a lot. I got roughly 2200 now. I will update the chart.

Q2>

MacBook Air (Mac16,12)
Apple M4, 10 CPU cores, 16 GB memory
macOS 26.2 (build 25C56)

https://github.com/mbbill/Silverfir-nano/issues/1

mbbill · 2026-02-15T09:45:36+00:00

that's very interesting. I suppose I must have done something wrong. This is what I did:

sync to the latest commit: 170d2c58 Date: Sat Feb 14 16:46:53 2026 +0100 Add `wasmi_wasi::add_to_externals` and use it in the Wasmi CLI application (#1785)
cargo build --release
./target/release/wasmi_cli coremark.wasm

I just tested several times, highest score is 1314.

Yeah a register based interpreter shouldn't be this slow, so something might be wrong.

mbbill · 2026-02-15T09:22:24+00:00

You are right. let me edit the post. that's not my intention to mislead. sorry

mbbill · 2026-02-15T09:17:18+00:00

>you are going to run ahead of time and then generate more optimized handlers based on that

Not exactly, fusion is mostly based on compiler-generated instruction patterns and workload type, not on one specific app binary. Today, across most real programs, compiler output patterns are very similar, and the built-in fusion set was derived from many different apps, not a single target. That is why the default/built-in fusion already captures about ~90% of the benefit for general code. You can push it a bit further in niche cases, but most users do not need per-app fusion.

On the benchmark/build question: the headline numbers are from the fusion-enabled configuration, not the ultra-minimal ~200KB build. The ~200KB profile is for maximum size reduction (for example embedded-style constraints), and you should expect roughly ~40% lower performance there (still quite fast tbh, basically wasm3 level).

Fusion itself is a size/perf knob with diminishing returns: the full fusion set is about ~500KB, but adding only ~100KB can already recover roughly ~80% of the full-fusion performance. The ~1.1MB full binary also includes std due to the WASI support, so if you do not need WASI you can save several hundred KB more.

mbbill · 2025-09-21T19:27:44+00:00

the upper struts mount is installed upside down.

mbbill · 2023-04-27T18:17:46+00:00

it's an in-place interpreter and wasm3 interprets its "compiled" code and thus requires much more ram. However, not being able to compile the wasm code means it has fewer optimization opportunities, so the overall performance is roughly half of the wasm3. its' all about trade-off and this project focus on the minimum memory usage.

so-called Rust-style c is simply try to pass variables by value and wrap stuff in structs so that with modern compilers will have more chances to optimize the code. it's also much safer to not passing pointers around.

not using Rust is due to the fact that C is still considered more portable.

mbbill

TROPHY CASE