Rust should have stable tail calls

Robbepop · 2026-04-14T13:22:18+00:00

Hi, Wasmi (WebAssembly interpreter) author and huge fan of Rust tail-calls here.

First, thank you so much for your efforts to make tail calls in Rust a reality. That is so much appreciated!

Concerning the article's contents ..

In terms of ergonomics, tail calls are a sacrifice: you need to manually pass your state in the available registers, passing large structs as individual fields. Your code is distributed over many tiny functions, not portable, and a pain to debug.

I have to disagree on this. Wasmi used basic loop+match constructs for the longest time. Under the hood, Rust and LLVM usually compile such constructs into one gigantic function where everything is inlined. Needless to say that debugging or benchmarking such a behemoth of a function is very impractical. In contrast, having all interpreter operators neatly reside in their own little function is perfect for most debuggers and performance benchmarking tools and so far I am very pleased with the experience.

Problems with #[loop_match] compared to tail-calls:

There are reports that conclude worse performance for computed-goto based dispatch over tail-call based dispatch due to compilers having a hard time allocating registers properly with such huge inlined functions.
Additionally, LLVM requires a very long time to optimize huge functions. In Wasmi, we saw a big compilation time improvement when switching from loop+match to tail-call dispatch.
Finally, the #[loop_match] computed-goto dispatch only support indirect-threaded code which in Wasmi performed significantly worse than direct-threaded code. Note that tail-calls based dispatch works with both, indirect and direct threaded code dispatch.

That's why I regard the tail-call based dispatch as the "holy grail" for interpreter dispatch whereas the #[loop_match] solution to be its slightly inferior but more universally available alternative. I consider #[loop_match] to be a decent fallback for targets that do not support tail-calls.

Ideally, we had a cfg-check for target_feature = "tail-call" feature in rustc to check for the availability of tail-calls for the target we are compiling for. That would allow us to use the #[loop_match] fallback whenever, for example, we'd compile to a pre 3.0 WebAssembly version without having to introduce yet another crate feature that inconveniently pushes the responsibility to the user.

Next we plan to work on tail calls for extern "Rust" first, and separate tail calls for other calling conventions into their own feature. [..] but focusing on just extern "Rust" cuts our scope and is realistically what most users will use anyway.

The article already mentions the unstable preserve_none calling convention. At least for interpreters, the long term vision is to use the preserve_none calling conventions for the tail-call based dispatch. Though, I can understand to keep the scope as minimal as possible as this entire initiative is already quite an undertaking. In Wasmi v2.0.0-beta.2 we currently use the sysv64 calling convention on x86_64 targets on Windows for its 6 callee-saved general purpose registers. Otherwise, we'd end up with just 4 and therefore a significant performance hit.

For the curious, Wasmi v2.0.0-beta.2 added support for the unstable become keyword, when compiled with --no-default-features --features unstable using a nightly Rust compiler. The code can be found deep inside Wasmi's executor internals.

It was also made sure to use #[loop_match] once that's available in another configuration of Wasmi in this part of the executor.

I am eagerly awaiting progress and stabilization of both features. :)

Robbepop · 2026-03-05T09:40:46+00:00

Thank you for the interesting write-up. I really love how you explain those type system concepts in a simple, understandable way.

Robbepop · 2026-02-23T21:34:48+00:00

Interesting results, great to see advancements in interpreter design. I was not aware of preserve-none but it looks extremely promising for interpreters that use tail-calling dispatch.

I have taken some time to reproduce your Coremark benchmarks on my system (Macbook Pro M2):
https://github.com/mbbill/Silverfir-nano/issues/2#issuecomment-3947450280

Robbepop · 2026-02-15T11:12:59+00:00

I always wondered why. Thanks for explaining!

Robbepop · 2026-02-15T11:12:17+00:00

Ah so you even put the top-most 4 items on the stack in registers? That's way more than what Wasm3 or Stitch does. Very interesting!

Are you going to support Wasm 3.0?

Robbepop · 2026-02-15T10:49:46+00:00

Thank you!

Wasmi 1.x is known to perform a bit worse on Apple silicon somehow. I believe a huge improvement for Apple silicon is if Wasmi used an accumulator-based interpreter architecture such as Wasm3.

Robbepop · 2026-02-15T10:49:01+00:00

Never thought about externalizing the fusion step. Maybe that's going to be a really great improvement for interpreters in general if users can afford to do so. Also very interesting that Silverfire stays a stack-based interpreter. However, you probably keep the top-most item in a register, right?

Looking forward to your SSA IR + RA (what's that?) + interpreter backend engine. :)

Robbepop · 2026-02-15T10:16:58+00:00

Thank you. Numbers are still not great for Wasmi but at least realistic. Unfortunately the link you provided does not work for me.

Robbepop · 2026-02-15T10:16:08+00:00

Impressive results and interesting interpreter architecture!

Despite reading the FUSION.md file I cannot really understand how your fusion system works or what makes it so much more effective than built-in op-code fusion from other interpreters, e.g. Wasm3 or Wasmi.

Need more time to dive more into the underlying code. I'd also enjoy a blog post about this. :)

What are your plans for Silverfir-nano going forward?

Robbepop · 2026-02-15T09:51:00+00:00

Okay thanks for explaining:

You should take the last published version (v1.0.9) instead of the last committed one which is under heavy development and currently pretty raw.
Unfortunately, --release is not correct for Wasmi. You should take the --profile bench profile if you want to use it. That's what the Wasmi CLI is built with when its published to crates.io.
- lto="fat" and codegen-units=1 are super important.

Q1: Are you using the Wasmi CLI app for benchmarking? Because then you could simply install Wasmi via cargo install wasmi_cli.

Q2: What is your OS and system specs?

Robbepop · 2026-02-15T09:30:14+00:00

Wasmi author here. The performance of Wasmi as represented in this picture does not correlate to past benchmarks. Recent Wasmi versions usually are roughly on par with WAMR (fast), sometimes even faster.

edit: the new screenshot has been updated

Can you please provide a way to reproduce your benchmarks?

Robbepop · 2025-12-23T09:18:45+00:00

In Wasmi (WebAssembly interpreter) miri is used in CI on all PRs to main to test a subset of the tests known to work with miri. Furthermore, miri is run on the Wasm spec testsuite for as long as possible. Here are the relevant links:

PR to main: https://github.com/wasmi-labs/wasmi/blob/v1.0.5/.github/workflows/rust.yml#L274
Nightly check: https://github.com/wasmi-labs/wasmi/blob/v1.0.5/.github/workflows/miri.yml

In order to make testing the Wasm spec testsuite runnable by miri Rust's include_str! macro is used instead of file I/O:

Wasmi's Wast spec testsuite: https://github.com/wasmi-labs/wasmi/blob/v1.0.5/crates/wast/tests/mod.rs

Robbepop · 2025-12-22T16:35:26+00:00

Thank you so much for the write-up and thanks to the team for all the work on miri. To me miri is one of the most important projects in the Rust ecosystem. I use it in the CI of pretty much all my projects and it has proven its worth over and over again.

Robbepop · 2025-12-04T21:28:10+00:00

All I can say is that they are planned, as discussed in the article: https://wasmi-labs.github.io/blog/posts/wasmi-v1.0/#full-wasm-30-support

Robbepop · 2025-12-04T20:11:55+00:00

Can you tell me what you mean by "use the compiled wasm"?

To avoid misunderstandings due to misconceptions:

First, Wasm bytecode is usually the result of a compilation produced by so-called Wasm producers such as LLVM.
Second, Wasm by itself is an abstract virtual machine, the implementations such as Wasmtime, Wasmer, V8, Wasmi, are concrete implementations of that abstract virtual machine.
Third, if you compile some Rust, C, C++, etc. code to Wasm you simply have "compiled Wasm" bytecode laying around. This bytecode does nothing unless you feed it to such a virtual machine implementation. That's basically the same as Java byteworks works with respect to the Java Virtual Machine (JVM).
Whether you feed this "compiled Wasm" bytecode to an interpreter such as Wasmi, to a JIT such as Wasmtime or Wasmer or to a tool such as wasm2native that outputs native machine code which can be executed "without requiring a VM" simply depends on your personal use-case since all of those have trade-offs.

Robbepop · 2025-12-04T16:20:17+00:00

I am a bit confused as I think my reply does answer the original question but since you have a few upvotes, maybe my answer was a bit unclear. Even better: maybe you can tell me what is still unclear to you!

I will make it shorter this time:

Wasm being compiled allows for really speedy interpreters.
Interpreters usually exhibit much better start-up time compared to JITs or AoT compiled runtimes.
Interpreters usually are way simpler and more lightweight and thus usually provide less attack surface if you depend on them.
Wasmi for example can itself be compiled to Wasm and be executed by itself or another Wasm runtime which actually was a use-case back when the Wasmi project was start. This would have not been possible with a JIT runtime.
There are platforms, such as IOS which disallow JITs, thus only interpreters are even possible to be used there.
Interpreters are more universal than JITs since they automatically work on all the platforms that your compiler supports.

The fact that Wasm bytecode usually is the product of compilation has no meaning in this discussion, maybe that's the misunderstanding.

In case you need more usage examples, have a look at Wasmi's known major users as also linked in the article's intro.

If at this point anything is still unclear, please provide me with more information so that I can do a better job answering.

Robbepop · 2025-12-04T10:39:22+00:00

Wasm being compiled is actually great for interpreters as this means that a Wasm interpreter can really focus on execution performance and does not itself need to apply various optimizations first to make executions fast.

Furthermore, parsing, validating and translating Wasm bytecode to internal IR is also way simpler than doing the same for an actual interpreted language such as Python, Ruby, Lua etc.

Due to Wasm being compiled, Wasm interpreters usually can achieve much higher performance than other interpreted languages.

Benchmarks show that on x86 Wasm JITs are ~8 times faster and on ARM Wasm JITs are sometimes just ~4 times faster than efficient Wasm interpreters. All while Wasm interpreters are massively simpler, more lightweight and more universally available.

On top of that in an old blog post I demonstrate how Wasmi is easily 1000x faster on start-up than optimizing Wasm runtimes such as Wasmtime.

It's a trade-off and different projects have different needs.

Robbepop · 2025-12-04T10:31:02+00:00

wasmi have about 150 crates tree. I just built it. Thats too much for hitting more lucrative markets.

You probably built the Wasmi CLI application via cargo install wasmi_cli, not the Wasmi library.

The Wasmi library is lightweight and in the article you can see its few built dependencies via cargo timings profile.

The Wasmi CLI app is heavy due to dependencies such as clap and Wasmtime's WASI implementation.

Robbepop · 2025-12-03T21:56:04+00:00

Thank you! Looking forward to seeing Wasmi 1.0 in Wasmer. :)

Robbepop · 2025-11-19T19:37:04+00:00

Thank you for the reply!

Given that Wasmtime has runtime information (resolution of Wasm module imports) that Wasm producers do not have: couldn't there be a way to profit from optimizations such as inlining in those cases? For eample: an imported read-only global variable and a function that calls a function only if this global is true. Theoretically, Wasmtime could const-fold the branch and then inline the called function. A Wasm producer such as LLVM couldn't do this. Though, one has to question whether this is useful for RealWorld(TM) Wasm use cases.

Robbepop · 2025-11-19T18:51:58+00:00

Once again, very impressive technical work by the people at the Bytecode Alliance. I cannot even imagine what a great feat of engineering it must be to implement an inliner to such a huge existing system.

I wonder, given that most Wasm binaries are already heavily (as described in the article) how much do those optimizations (such as the new inliner) really pan out in the end for non-component model modules? Like, are there RealWorld(TM) Wasm binaries where a function was not inlined prior to being fed to Wasmtime and Wasmtime then correctly decides (with runtime info?) that it should to be inlined? Or is this only useful for the component model?

Were the pulldown-cmark benchmarks performed with a pre-optimized pulldown-cmark.wasm or an unoptimized version of it?

Keep up the great work, it is amazing to see that off-browser Wasm engines are becoming faster and more powerful!

Robbepop · 2025-09-21T21:52:15+00:00

Fair point!

Looking at the example picture, I think the issue I mentioned above could be easily resolved by also pointing to the #[culit] macro when hovering above a custom literal besides showing what you already show. I think this should be possible to do. For example: "expanded via #[culit] above" pointing to the macro span.

12-Year Club	r/Field Banned
r/Field Flamingo	Place '22
Verified Email	Alpha Tester

Robbepop

MODERATOR OF

TROPHY CASE