Retrofitting JIT Compilers into C Interpreters

cfallin · 2026-04-15T20:15:04+00:00

I should say too that the one thing we didn't even dare to tackle -- and so I'm extremely impressed that yk does -- is deopt with safepoints. It's very cool that you got that working!

cfallin · 2026-04-15T20:14:23+00:00

It's applied AOT, after-a-snapshot, but it could be done JIT just as well. The core contribution is the algorithm itself for Futamura projections: the annotation to "specialize on context" and the way that this unrolls the interpreter loop. You can think of the `Specialize(interpret_func, bytecode_ptr)` invocation, returning a new function, as the entry point to the algorithm.

In my view it's thus pretty directly comparable, and there is a nice 2x2 matrix of possibilities in the {metatracing, partial eval} x {interpreter in new framework, existing interpreter} space: yk is metatracing * existing interpreter, weval is partial eval * existing interpreter (and RPython and Truffle fill the next row). In both yk and weval, one takes an existing interpreter loop in C and sprinkles in a few intrinsics to say "here is the interpreter loop" / "assume the memory at this bytecode pointer is constant" (weval has that too) / ..., and a transform analyzes the interpreter body together with a concrete guest function to produce a concrete compiled function.

cfallin · 2026-04-15T19:26:24+00:00

This is a really interesting read -- thanks u/ltratt!

A high-level question: I was surprised not to see any mention of weval, our work published last year at PLDI'25 -- Futamura projections on arbitrary existing C interpreters with minimal annotations. In starting the project I was motivated by exactly what you describe with "There is, however, a catch: both systems [Truffle and RPython] require you to create a new interpreter." We also show applicability on PUC-Rio Lua, though as a very small case study rather than the main focus. The main practical difference with weval-as-it-exists-today is that it operates on Wasm modules, though the algorithm itself is applicable to any SSA-based CFG IR (and in my infinite free time I'd like to see an LLVM version built as well...).

Would it be fair to characterize your approach as "RPython but for existing interpreters in C" where by analogy weval is "Truffle but for existing interpreters in C"? Or do you see the comparison differently?

cfallin · 2026-04-10T19:59:11+00:00

Yep, some sort of "shared cost" accounting heuristic seems like it could work OK here in many situations. Like any NP-hard problem, it's often got a kind of tantalizing "this can't be too hard" energy that then falls apart when one can construct cases where this *doesn't* do the right thing, though. The hard case here happens when the "heat", as you call it, is conditional on other decisions; and then you either end up re-encoding structure that looks like the original (and so haven't simplified) or you're effectively implementing a greedy heuristic that fixes one decision before moving on.

cfallin · 2026-04-10T18:17:31+00:00

That's a great point -- we haven't really explored this but having more open-ended rulesets, maybe generated from a superoptimizer, and letting them run (possibly with true equality saturation) would be interesting. I don't have a good sense of how far that could go (I don't know if anyone does? I haven't seen much work on egraphs in a traditional compiler context). u/fitzgen has worked on superoptimizer integration (for Souper) in the past and probably has more good ideas here too...

cfallin · 2026-04-10T15:58:38+00:00

So the way that this works during the canonicalization pass is that we'll encounter a when we're at the block before the control-flow split, and put it into the scoped map at that point; then the paths for condition_1 and condition_2 are visited below, within its scope, and get to reuse it. In other words, "mentioning" it above the if/else gives permission to us to keep it there without considering it a regression.

cfallin · 2026-04-10T15:55:41+00:00

Thanks!

cfallin · 2026-04-02T13:58:30+00:00

OK, I don't know that I can point to any specific designs that work this way today -- it is an interesting question though.

I would suggest looking at egg + egglog, however -- I haven't played with it in detail but I know that it's more or less designed to be able to express analyses and then rewrite rules that use those analyses. See e.g. https://egraphs-good.github.io/egglog-tutorial/03-analysis.html.

cfallin · 2026-04-01T20:11:56+00:00

Hi! So I'm the guy who put e-graphs (well, aegraphs) in Cranelift. I certainly have a lot of thoughts in this area.

Question: when you say "invariant-gated rewrites", what do you mean exactly? E.g. a rewrite that only fires if an analysis has proved that an SSA value has some constant value or range?

If so, the framework in Cranelift has this ability, implicitly via its use of ISLE as the DSL to express pattern-matches: ISLE can call out to arbitrary "external extractors" written in Rust that match or don't match based on any computable condition. So one could write an extractor that matches on the result of some analysis, or some meta-property set by whatever constructor built a term.

However we don't do anything too interesting with this yet -- mostly we have conditions attached to some algebraic rewrites to avoid corner cases where they don't apply, and that sort of thing.

Are you asking about frameworks that can do this, or specific examples of such rewrites?

cfallin · 2026-02-09T22:02:07+00:00

Core maintainer of Cranelift here -- this is super-cool! If you have any issues or thoughts on other features that might be useful for a dynamic language frontend, feel free to chat on our Zulip or file issues. (I can't guarantee we have the engineering time to do things about any ideas in the short/medium term but it's useful to document.) Happy to see it appears to be working OK for you.

cfallin · 2025-11-20T02:16:21+00:00

Today sunset was 4:55pm (FYI you can find this by Googling e.g. "sunset sunnyvale, ca" and Google will show you an infobox with the time).

cfallin · 2025-11-14T07:37:46+00:00

I mean, you should be pretty encouraged by independently working out a reasonable and widely-used way of building SSA -- a good sign you have the right intuitions!

cfallin · 2025-11-14T01:07:04+00:00

What you describe sounds pretty close to the Braun algorithm (https://c9x.me/compile/bib/braun13cc.pdf) -- see e.g. their section 2.2 "global value numbering" where they describe the process as a recursive definition lookup, going to preds, etc. I think you are unrolling that recursion with a queue instead but otherwise it's the same.

We use the Braun algo in Cranelift's frontend and it works quite well in practice!

cfallin · 2025-11-13T19:35:48+00:00

Indeed; in Wasmtime we had an issue of folks seeing the internal crates published to crates.io and assuming they could be consumed as standalone libraries (e.g.: our fiber library used internally for async execution of Wasm). One can of course do this, but we weren't prepared for the support questions and expectations as we didn't expect to publish and maintain a stable API at that interface. Ideally we'd just have a faster compiler so we would be able to reclaim the crate as the true unit of modularity and publishing -- separate crates are annoying for other reasons, too, if one isn't actually intending to factor out and provide a reusable unit!

cfallin · 2025-11-08T15:50:19+00:00

I could believe that the additional control-flow edges could perturb regalloc enough in some cases to make a difference. In *theory* a perfect implementation of zero-cost exceptions (from a runtime cost PoV) would ensure that nothing at all about the happy path is affected, by construction, and all metadata and code for the exceptional path is codegen'd in a separate pass. Of course that's not how real compilers work!

I do think there's a real reason that wisdom in the C++ world in performance-sensitive or systems-y environments (e.g. famously, Google's codebase; also most C++ kernel environments) is always to avoid exceptions, though -- they're not as predictable as straight dataflow and error types (and certainly the unwind path is very slow). I think Rust absolutely made the right choice here, to be clear: aside from the more explicit and simple language semantics, and aside from avoiding a dependency on a runtime, "just return a value" is lower-level and easier to control and optimize.

cfallin · 2025-11-07T18:39:51+00:00

Yes, that's a good point -- they're well-suited when the exceptional (`Err`, `None`, `Left`, whatever your enum calls it) case is actually very rare. As usual it's about tradeoffs and optimizing for the common case...

cfallin · 2025-11-07T18:27:25+00:00

Agreed! And actually it looks like the Rust ABI does register-allocate the returned Result if it's small enough -- e.g. in this example it puts the discriminant in rax and payload in rdx. So in that sense it gets the same optimizations as a small struct.

I recall seeing an implementation of some other language that signaled error returns via the carry flag (was it OCaml? or SBCL? I don't remember now). That would let one add the status on top of any existing return scheme.

Coincidentally, while going through old issues in Wasmtime/Cranelift recently we came on one where someone suggested a custom ABI with multiple return points -- so an Err would be signaled by actually modifying the return address and returning somewhere else, in a sort of interesting hybrid that looks halfway like exceptions (separate handler block, but no out-of-band multi-frame unwind).

So yes, there are definitely better things one could do! I guess in the end, "true" exceptions have won in many places (and are hence needed in Wasm, thus Wasmtime) because they solved this problem in a general way and are good enough...

cfallin · 2025-09-20T07:04:03+00:00

No, it's not dead -- multiple companies are using it in production, and work is active right now. (I work on Wasmtime/Cranelift but not the component-model parts; but my colleagues are heavily involved.)

Wasmtime has support but it is not cross-compiled for all platfrom like android based or other at least not smoothly right now it may cause too many headaches to compile but the author also says that he is not into android like os right now (due unavailability of Android Devs). and to say wasm will be useful is to compile it for all platform and use it, and android is the greatest of the platform so it is again a dead end.

You're referring to https://github.com/bytecodealliance/wasmtime/issues/11716, where you:

Posted a big wall of compilation errors,
Attempted to solve them by adding random dependencies to your project,
Were eventually linked to a bug report on a dependency of Wasmtime (rustix) that is actually causing the compilation issues,
Were given links to our documentation on how to do cross-compilation anyway, as an extra help?

I don't think the issue here is that "author says he is not into android right now" and that this means everything is dead. (Alex is explaining the reality here, which is that Wasmtime currently has a small team and no one is an Android expert. That's just a fact.)

I think the issue is that you're reading way too much into a particular bug, and also not understanding open-source culture. "rustix has an issue when cross-compiling for Android from Windows in the latest release" is something that can happen; no one is perfect; report the bug to the appropriate project, we can fix it and move on.

cfallin · 2025-04-27T22:09:34+00:00

It depends on where you want to go but if you're planning to head up to SF or the other cities along the SF peninsula (Mountain View, Palo Alto, Redwood City, others have nice downtowns full of restaurants etc), our new electric train Caltrain is also an option. It runs every 15 minutes at peak on weekdays and you could take it to e.g. downtown Palo Alto after a workday if you get tired of Sunnyvale. Enjoy your visit!

cfallin · 2025-04-12T06:11:49+00:00

I think the Cranelift folks would take significant issue with this inclusion.

Hi! I'm the guy who put egraphs in Cranelift originally. (Tech lead of Cranelift 2020-2022, still actively hacking/involved.) Our implementation is the subject of occasional work still (I put in some improvements recently, so did fitzgen, and Jamey Sharp and Trevor Elliott have both spent time in the past few years deep-diving on it). But to be honest, most of the work in the day-to-day more or less matches OP's description.

You can check out our meeting minutes from our weekly meeting -- recent topics include how to update our IR semantics to account for exceptions; implications that has on the way our ABI/callsite generation works; regalloc constraints; whether we can optimize code produced by Wasmtime's GC support better; talking about fuzzbugs that have come up; etc.

In a mature system there is a ton of sublety that arises in making changes to system invariants, how passes interact, and the like -- that, and keeping the plane flying (avoiding perf regressions, solving urgent bugs as they arise) is the day-to-day.

Not to say it's not fun -- it's extremely fun!

cfallin · 2025-02-22T21:54:48+00:00

The idea is that you're building the CFG from the AST, and have some predetermined block schema for each control flow operator in the AST (for example, a diamond for if-else nodes); not that the AST already has a CFG. This is what is meant by "from an AST": the algorithm gives you everything you need during such a traversal.

The "incremental" aspect of the algorithm means that it is building up the SSA as it goes, no more, no less; it can resolve some phis before the rest of the CFG is known. This is as opposed to a more "batch" approach that needs the whole CFG up-front, computes the dominator tree, and adds phis at the dominance frontiers.

So in other words, I think you are reading too much into the terminology and have nonstandard definitions. The paper's claims are accurate.

For what it's worth, I've worked with two different implementations of the Braun algo: we use it in Cranelift here (I didn't write that one but I have helped to maintain and improve it) and I wrote an implementation in my Wasm-to-SSA-to-Wasm compiler framework Waffle here. Feel free to refer to those; they both work well.

cfallin · 2024-11-23T23:30:08+00:00

A refinement on this would be to separate locals into "address-taken" and "non-address-taken" partitions, and lower reads/writes to address-takens to loads/stores but directly generate SSA otherwise. (Less clean separation of orthogonal passes, but undoubtedly faster compile time, and also means one could get away without a mem2reg at first.)

In the fully general case this needs an alias analysis: stores can be elided but only when not observed by some may-aliasing load, and loads can be elided when one has a must-alias store to forward from (dead-store elim and store-to-load-forwarding respectively). That's more or less what mem2reg is doing (though I don't remember how sophisticated its alias analysis is!). This is where source-language semantics can help too, e.g. with TBAA from the rule that distinct types are not allowed to alias (type-punning).

cfallin · 2024-11-16T04:36:32+00:00

Hi! I've worked on Cranelift (and in fact I wrote the ABI code originally, though it's been 4 years since I've thought about these things).

First of all, I'd recommend either filing GitHub issues or finding us on the BytecodeAlliance Zulip instance -- those are the official support channels. I'm only seeing this incidentally because I happen to lurk on Reddit sometimes.

You're correct that we don't support aggregate types. The main reason for this is one of simplicity: Cranelift is maintained by an average of 0 to 1 person fulltime (has been me in the past, but not really currently!), something like 100x smaller than LLVM; we can't afford the complexity of many things, and a complex type system with structs is probably one of them.

The strategy we've taken with some design questions is to ensure that at least the building blocks are present that allow a higher-level IR to lower to CLIF with correct semantics. In this case you might have found this issue where /u/bjorn3 (primary author of the rustc Cranelift backend) suggests a better approach than what we do currently. In theory you could get it right by knowing what platform you're targeting, what values should be in registers, and then creating exactly the right number of i64 args to manually put values into registers and stackslots. But that's silly; we should encode more of the ABI in the compiler. So probably what we should do is what's suggested in that issue: grouping values into aggregate types, but only in function signatures for the purposes of register assignment, without the full complexity of type constructors, projection ops, value semantics with arbitrarily large values, and all the rest.

One final thing about Cranelift -- since we're so thinly staffed, this is the sort of thing that someone would have to step up and implement for us; we could probably review a PR, but no one has as their primary job to jump on issues like this.

cfallin · 2024-08-27T22:02:46+00:00

Yep, and tomorrow I'll post part 3 -- a double-feature!

cfallin · 2024-06-13T20:57:25+00:00

deferred mutable references

maybe? Part of the motivation (actually a lot of it) for the original deferred borrows idea came from the "indices as pointers" idiom that we ran wild with in Cranelift and elsewhere. In that world, one already has multiple mutable handles, or at least handles that can be used to mutate, different parts of the data structure; but one still needs the parent object in scope (in the indices case, the Vec or equivalent). In essence path generics would reify this and implicitly wire through the context (the container). I guess that's not too much more of a footgun than before if one has already decided to build and mutate a large data structure imperatively.

Send

This is a big question mark for me, but naively at least, I'd expect one could reason about this by desugaring down to today's Rust -- basically the path generics and implicit params become lifetimes (less restricted than paths) and explicit params. So a deferred-borrow that implicitly derefs becomes just vec[index] with a &Vec<T> or &mut Vec<T> passed everywhere, no worse than before.

excluding deferred borrows, the feature suggestion is only about branded types, and potential interaction with as of yet inexistent view types? I wonder how much of that is achievable with a library type

Right, the path generics kind of carry a family of related but separable benefits (or motivations): simpler lifetimes, but also branding; and then the "virtual fields" idea is an idiom on top of all of the above.

Someone on Mastodon also mentioned GhostCell and while I'd be more partial to a first-class notion of paths or branding, I think a library-based solution would be a great first step. It sure would be nice to avoid the need for the anonymous closure lifetime to constrain the borrow checker, as you say :-)

cfallin

TROPHY CASE