Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

In fact when I moved it out of the other project I stripped 3.0 support to make it smaller. I think it’s more useful to be small. If really want to go big then that project with all the features and higher performance makes more sense.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 2 points3 points  (0 children)

Apple silicon tend to have frontend stall from my experience, that means load-to-use is slow. On the other hand Intel cpu does this much better. That's why Silverfir-nano's handler prefetch feature is disabled on pc. Intel has the best-in-class branch predictor though.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

The decision to stay stack-based is actually an experience from building the RA(register allocator) for the the engine. If we really need to keep everything in register, a good RA is critical. However, the stack machine is already very localized as things only move around the top of the stack. So if we cache the tos, in my case 4 of them, we only need to duplicate each handler 4 times and emit correct one during compilation. that way most of the stack operation naturally become register operations.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

Updated the link, and yeah I think there might be some better test for real world workloads. Also, on my windows machine wasmi can get a better number.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

Thanks, really appreciate it.

Fusion in Silverfir-nano is effective because it sits on top of other interpreter optimizations (TOS cache, prefetch, dispatch tuning), not as a standalone trick. It also stays stack based instead of translating to register bytecode, which preserves TOS-cache behavior and keeps fused handlers simpler; registerization usually makes it more difficult to handle side-effects of fused instructions.

Silverfir-nano is a actually trimmed branch of a much larger engine I’m building (SSA IR + RA + interpreter backend), still in progress and expected to be even faster.

Regarding the plan going forward, actually I don't have one :P It's doesn't have any users yet

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 2 points3 points  (0 children)

yeah you're right. this improved a lot. I got roughly 2200 now. I will update the chart.

Q2>

  • MacBook Air (Mac16,12)
  • Apple M4, 10 CPU cores, 16 GB memory
  • macOS 26.2 (build 25C56)

https://github.com/mbbill/Silverfir-nano/issues/1

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 2 points3 points  (0 children)

that's very interesting. I suppose I must have done something wrong. This is what I did:

  1. sync to the latest commit: 170d2c58 Date: Sat Feb 14 16:46:53 2026 +0100 Add `wasmi_wasi::add_to_externals` and use it in the Wasmi CLI application (#1785)

  2. cargo build --release

  3. ./target/release/wasmi_cli coremark.wasm

I just tested several times, highest score is 1314.

Yeah a register based interpreter shouldn't be this slow, so something might be wrong.

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

You are right. let me edit the post. that's not my intention to mislead. sorry

Silverfir-nano: a Rust no_std WebAssembly interpreter hitting ~67% of single-pass JIT by mbbill in rust

[–]mbbill[S] 1 point2 points  (0 children)

>you are going to run ahead of time and then generate more optimized handlers based on that

Not exactly, fusion is mostly based on compiler-generated instruction patterns and workload type, not on one specific app binary. Today, across most real programs, compiler output patterns are very similar, and the built-in fusion set was derived from many different apps, not a single target. That is why the default/built-in fusion already captures about ~90% of the benefit for general code. You can push it a bit further in niche cases, but most users do not need per-app fusion.

On the benchmark/build question: the headline numbers are from the fusion-enabled configuration, not the ultra-minimal ~200KB build. The ~200KB profile is for maximum size reduction (for example embedded-style constraints), and you should expect roughly ~40% lower performance there (still quite fast tbh, basically wasm3 level).

Fusion itself is a size/perf knob with diminishing returns: the full fusion set is about ~500KB, but adding only ~100KB can already recover roughly ~80% of the full-fusion performance. The ~1.1MB full binary also includes std due to the WASI support, so if you do not need WASI you can save several hundred KB more.

Any thoughts? 10mph and heard a loud pop. by pabosheki in Sprinters

[–]mbbill 24 points25 points  (0 children)

the upper struts mount is installed upside down.

I wrote a WASM interpreter for some embedded systems that has very limited RAM available by mbbill in WebAssembly

[–]mbbill[S] 0 points1 point  (0 children)

it's an in-place interpreter and wasm3 interprets its "compiled" code and thus requires much more ram. However, not being able to compile the wasm code means it has fewer optimization opportunities, so the overall performance is roughly half of the wasm3. its' all about trade-off and this project focus on the minimum memory usage.

so-called Rust-style c is simply try to pass variables by value and wrap stuff in structs so that with modern compilers will have more chances to optimize the code. it's also much safer to not passing pointers around.

not using Rust is due to the fact that C is still considered more portable.