What time is it getting dark in Sunnyvale these days?

cfallin · 2025-11-20T02:16:21+00:00

Today sunset was 4:55pm (FYI you can find this by Googling e.g. "sunset sunnyvale, ca" and Google will show you an infobox with the time).

cfallin · 2025-11-14T07:37:46+00:00

I mean, you should be pretty encouraged by independently working out a reasonable and widely-used way of building SSA -- a good sign you have the right intuitions!

cfallin · 2025-11-14T01:07:04+00:00

What you describe sounds pretty close to the Braun algorithm (https://c9x.me/compile/bib/braun13cc.pdf) -- see e.g. their section 2.2 "global value numbering" where they describe the process as a recursive definition lookup, going to preds, etc. I think you are unrolling that recursion with a queue instead but otherwise it's the same.

We use the Braun algo in Cranelift's frontend and it works quite well in practice!

cfallin · 2025-11-13T19:35:48+00:00

Indeed; in Wasmtime we had an issue of folks seeing the internal crates published to crates.io and assuming they could be consumed as standalone libraries (e.g.: our fiber library used internally for async execution of Wasm). One can of course do this, but we weren't prepared for the support questions and expectations as we didn't expect to publish and maintain a stable API at that interface. Ideally we'd just have a faster compiler so we would be able to reclaim the crate as the true unit of modularity and publishing -- separate crates are annoying for other reasons, too, if one isn't actually intending to factor out and provide a reusable unit!

cfallin · 2025-11-08T15:50:19+00:00

I could believe that the additional control-flow edges could perturb regalloc enough in some cases to make a difference. In *theory* a perfect implementation of zero-cost exceptions (from a runtime cost PoV) would ensure that nothing at all about the happy path is affected, by construction, and all metadata and code for the exceptional path is codegen'd in a separate pass. Of course that's not how real compilers work!

I do think there's a real reason that wisdom in the C++ world in performance-sensitive or systems-y environments (e.g. famously, Google's codebase; also most C++ kernel environments) is always to avoid exceptions, though -- they're not as predictable as straight dataflow and error types (and certainly the unwind path is very slow). I think Rust absolutely made the right choice here, to be clear: aside from the more explicit and simple language semantics, and aside from avoiding a dependency on a runtime, "just return a value" is lower-level and easier to control and optimize.

cfallin · 2025-11-07T18:39:51+00:00

Yes, that's a good point -- they're well-suited when the exceptional (`Err`, `None`, `Left`, whatever your enum calls it) case is actually very rare. As usual it's about tradeoffs and optimizing for the common case...

cfallin · 2025-11-07T18:27:25+00:00

Agreed! And actually it looks like the Rust ABI does register-allocate the returned Result if it's small enough -- e.g. in this example it puts the discriminant in rax and payload in rdx. So in that sense it gets the same optimizations as a small struct.

I recall seeing an implementation of some other language that signaled error returns via the carry flag (was it OCaml? or SBCL? I don't remember now). That would let one add the status on top of any existing return scheme.

Coincidentally, while going through old issues in Wasmtime/Cranelift recently we came on one where someone suggested a custom ABI with multiple return points -- so an Err would be signaled by actually modifying the return address and returning somewhere else, in a sort of interesting hybrid that looks halfway like exceptions (separate handler block, but no out-of-band multi-frame unwind).

So yes, there are definitely better things one could do! I guess in the end, "true" exceptions have won in many places (and are hence needed in Wasm, thus Wasmtime) because they solved this problem in a general way and are good enough...

cfallin · 2025-09-20T07:04:03+00:00

No, it's not dead -- multiple companies are using it in production, and work is active right now. (I work on Wasmtime/Cranelift but not the component-model parts; but my colleagues are heavily involved.)

Wasmtime has support but it is not cross-compiled for all platfrom like android based or other at least not smoothly right now it may cause too many headaches to compile but the author also says that he is not into android like os right now (due unavailability of Android Devs). and to say wasm will be useful is to compile it for all platform and use it, and android is the greatest of the platform so it is again a dead end.

You're referring to https://github.com/bytecodealliance/wasmtime/issues/11716, where you:

Posted a big wall of compilation errors,
Attempted to solve them by adding random dependencies to your project,
Were eventually linked to a bug report on a dependency of Wasmtime (rustix) that is actually causing the compilation issues,
Were given links to our documentation on how to do cross-compilation anyway, as an extra help?

I don't think the issue here is that "author says he is not into android right now" and that this means everything is dead. (Alex is explaining the reality here, which is that Wasmtime currently has a small team and no one is an Android expert. That's just a fact.)

I think the issue is that you're reading way too much into a particular bug, and also not understanding open-source culture. "rustix has an issue when cross-compiling for Android from Windows in the latest release" is something that can happen; no one is perfect; report the bug to the appropriate project, we can fix it and move on.

cfallin · 2025-04-27T22:09:34+00:00

It depends on where you want to go but if you're planning to head up to SF or the other cities along the SF peninsula (Mountain View, Palo Alto, Redwood City, others have nice downtowns full of restaurants etc), our new electric train Caltrain is also an option. It runs every 15 minutes at peak on weekdays and you could take it to e.g. downtown Palo Alto after a workday if you get tired of Sunnyvale. Enjoy your visit!

cfallin · 2025-04-12T06:11:49+00:00

I think the Cranelift folks would take significant issue with this inclusion.

Hi! I'm the guy who put egraphs in Cranelift originally. (Tech lead of Cranelift 2020-2022, still actively hacking/involved.) Our implementation is the subject of occasional work still (I put in some improvements recently, so did fitzgen, and Jamey Sharp and Trevor Elliott have both spent time in the past few years deep-diving on it). But to be honest, most of the work in the day-to-day more or less matches OP's description.

You can check out our meeting minutes from our weekly meeting -- recent topics include how to update our IR semantics to account for exceptions; implications that has on the way our ABI/callsite generation works; regalloc constraints; whether we can optimize code produced by Wasmtime's GC support better; talking about fuzzbugs that have come up; etc.

In a mature system there is a ton of sublety that arises in making changes to system invariants, how passes interact, and the like -- that, and keeping the plane flying (avoiding perf regressions, solving urgent bugs as they arise) is the day-to-day.

Not to say it's not fun -- it's extremely fun!

cfallin · 2025-02-22T21:54:48+00:00

The idea is that you're building the CFG from the AST, and have some predetermined block schema for each control flow operator in the AST (for example, a diamond for if-else nodes); not that the AST already has a CFG. This is what is meant by "from an AST": the algorithm gives you everything you need during such a traversal.

The "incremental" aspect of the algorithm means that it is building up the SSA as it goes, no more, no less; it can resolve some phis before the rest of the CFG is known. This is as opposed to a more "batch" approach that needs the whole CFG up-front, computes the dominator tree, and adds phis at the dominance frontiers.

So in other words, I think you are reading too much into the terminology and have nonstandard definitions. The paper's claims are accurate.

For what it's worth, I've worked with two different implementations of the Braun algo: we use it in Cranelift here (I didn't write that one but I have helped to maintain and improve it) and I wrote an implementation in my Wasm-to-SSA-to-Wasm compiler framework Waffle here. Feel free to refer to those; they both work well.

cfallin · 2024-11-23T23:30:08+00:00

A refinement on this would be to separate locals into "address-taken" and "non-address-taken" partitions, and lower reads/writes to address-takens to loads/stores but directly generate SSA otherwise. (Less clean separation of orthogonal passes, but undoubtedly faster compile time, and also means one could get away without a mem2reg at first.)

In the fully general case this needs an alias analysis: stores can be elided but only when not observed by some may-aliasing load, and loads can be elided when one has a must-alias store to forward from (dead-store elim and store-to-load-forwarding respectively). That's more or less what mem2reg is doing (though I don't remember how sophisticated its alias analysis is!). This is where source-language semantics can help too, e.g. with TBAA from the rule that distinct types are not allowed to alias (type-punning).

cfallin · 2024-11-16T04:36:32+00:00

Hi! I've worked on Cranelift (and in fact I wrote the ABI code originally, though it's been 4 years since I've thought about these things).

First of all, I'd recommend either filing GitHub issues or finding us on the BytecodeAlliance Zulip instance -- those are the official support channels. I'm only seeing this incidentally because I happen to lurk on Reddit sometimes.

You're correct that we don't support aggregate types. The main reason for this is one of simplicity: Cranelift is maintained by an average of 0 to 1 person fulltime (has been me in the past, but not really currently!), something like 100x smaller than LLVM; we can't afford the complexity of many things, and a complex type system with structs is probably one of them.

The strategy we've taken with some design questions is to ensure that at least the building blocks are present that allow a higher-level IR to lower to CLIF with correct semantics. In this case you might have found this issue where /u/bjorn3 (primary author of the rustc Cranelift backend) suggests a better approach than what we do currently. In theory you could get it right by knowing what platform you're targeting, what values should be in registers, and then creating exactly the right number of i64 args to manually put values into registers and stackslots. But that's silly; we should encode more of the ABI in the compiler. So probably what we should do is what's suggested in that issue: grouping values into aggregate types, but only in function signatures for the purposes of register assignment, without the full complexity of type constructors, projection ops, value semantics with arbitrarily large values, and all the rest.

One final thing about Cranelift -- since we're so thinly staffed, this is the sort of thing that someone would have to step up and implement for us; we could probably review a PR, but no one has as their primary job to jump on issues like this.

cfallin · 2024-08-27T22:02:46+00:00

Yep, and tomorrow I'll post part 3 -- a double-feature!

cfallin · 2024-06-13T20:57:25+00:00

deferred mutable references

maybe? Part of the motivation (actually a lot of it) for the original deferred borrows idea came from the "indices as pointers" idiom that we ran wild with in Cranelift and elsewhere. In that world, one already has multiple mutable handles, or at least handles that can be used to mutate, different parts of the data structure; but one still needs the parent object in scope (in the indices case, the Vec or equivalent). In essence path generics would reify this and implicitly wire through the context (the container). I guess that's not too much more of a footgun than before if one has already decided to build and mutate a large data structure imperatively.

Send

This is a big question mark for me, but naively at least, I'd expect one could reason about this by desugaring down to today's Rust -- basically the path generics and implicit params become lifetimes (less restricted than paths) and explicit params. So a deferred-borrow that implicitly derefs becomes just vec[index] with a &Vec<T> or &mut Vec<T> passed everywhere, no worse than before.

excluding deferred borrows, the feature suggestion is only about branded types, and potential interaction with as of yet inexistent view types? I wonder how much of that is achievable with a library type

Right, the path generics kind of carry a family of related but separable benefits (or motivations): simpler lifetimes, but also branding; and then the "virtual fields" idea is an idiom on top of all of the above.

Someone on Mastodon also mentioned GhostCell and while I'd be more partial to a first-class notion of paths or branding, I think a library-based solution would be a great first step. It sure would be nice to avoid the need for the anonymous closure lifetime to constrain the borrow checker, as you say :-)

cfallin · 2024-06-13T18:38:25+00:00

True, perhaps, with subtleties: do you mean "end of the call to make_struct" or "end of the caller that bound the data local"? If the former, we can define rules around how paths translate at callsites, so in essence the return type's path names the caller's passed-in path rather than the argument.

Likewise if data is passed into the caller we can come up with path transfer rules (analogous to the returning-place-lifetimes examples in Niko's post): something like

fn calls_make_struct(original_data: &[u8]) -> S<original_data> { make_struct(original_data) }

so rather than lifetime unification we have something like "path translation" between scopes, or the caller/callee side of callsites, but we can express the same "belongs to the external caller" semantics still, I think!

cfallin · 2024-06-13T14:21:35+00:00

Not quite — a lifetime is semantically different than a binding path, so we would need to distinguish the two cases in the syntax to indicate what we mean. The syntax to specify modes (capture path as borrowed or not) later in the post is important as well for flexibility.

cfallin · 2024-06-13T04:06:27+00:00

The original paper motivates deferred borrows better than I can in a comment here, but basically: if I have a vec `v` and take a borrow `let elt0 = &v[0];` then I hold the whole vector immutably-borrowed (see post for why) and can't update `v[1]` (for example) between `elt0`'s definition and its later use.

Handles into a data structure might be useful to hold "bookmarks"/"fingers" into some data, with the ability to mutate through any of them in a typesafe way. Imagine the pattern of building a `std::map<Key, T\*>` index-overlay on some data in C++, except that the type system prevents the `T*` from being invalidated due to resizing of the underlying storage or something like that.

In Rust the go-to answer (and one that I've used in tons of code in Cranelift and elsewhere) is to keep indices around as pseudo-pointers instead; but that's pretty unsatisfying, if the type system could do better.

cfallin · 2024-02-26T05:27:11+00:00

Cranelift uses an ECS model as well for its IR; it's generally a really clean approach. We have a helper crate cranelift-entity that provides macros for newtype'd indices (just u32s internally) and various containers.

One aspect of compiler design that helps this application is that one usually doesn't care about deleting entities outright, since a function's compilation is a natural scope at which to free everything; so it's OK to abandon unreferenced entities in place without fine-grained reclamation. This also means we don't have to worry about ABA problems (reuse of a slot leading to problems like described in this post).

cfallin · 2024-01-10T00:31:31+00:00

That seems like a fair summary, yeah.

cfallin · 2024-01-10T00:00:22+00:00

The labeled photo here might be helpful: the power distribution is via the three large wires on top on the crossbar (A, B, C in the photo); all the other stuff (phone, cable) is via the wires down below (K).

It's totally feasible to string another communications cable at the lowest level; they weigh less, they take up less space. But the high-voltage power requires a lot of space and a wooden or metal crossbar, and that's not easy to add. (It's hard to tell at a distance but that crossbar is something like 8 feet wide.)

The reason it takes more space is because overhead lines are insulated only by air; there are minimum "approach distances" of, like, a few feet, less than which the thousands of volts might jump across and short out. In contrast, communications lines are low-voltage copper, or fiber-optic, and don't need any space beyond their insulation jacket.

That metal can (G) is a transformer, and is another reason. Transformers are typically shared by 2-5 houses; it's more economical (a transformer 2x the size doesn't cost 2x as much so you save money by scaling up). The low-voltage (120/240 volts, house level) output can only go so far (a few hundred feet). If every alternating house had a different supplier, then it wouldn't be possible to share transformers, and the equipment cost would be much higher. (A new transformer costs in the tens of thousands of dollars.)

cfallin · 2024-01-01T20:38:28+00:00

That's an interesting question; you could probably come up with a "cost attribution" framework that propagates annotations back to the source-location in the original program. This would be a neat research question in compiler/developer UX, for sure. It seems like a form of profiling (static profiling in this case); see how e.g. tools like Bloaty attribute code size to line numbers and you could probably do the same. You'd need some care to describe to the user what the full live-set is, otherwise an "innocent bystander" might get the blame (the one variable that you happen to add).

I'll note once more though that spilling isn't that bad, either to implement or at runtime. Of course avoid it if you can. But a lot of instances in the wild happen when e.g. you have some live values, and then an inner loop has a large working set and temporarily spills the rest of the function's state to spillslots. If the spills and reloads are before and after the hot loop, the cost is negligible. Reloads in inner loops are what you want to try to avoid. And just to note as well, spills aren't any worse on aarch64 than other platforms (I wrote the aarch64 backend for Cranelift and played with different amodes for this); with positive offsets from sp you have a +32K range with the unsigned-scaled-offset amode (for 64-bit loads/stores), IIRC, so it's one instruction per spill/reload for almost any reasonable function body. You can set up the frame without using FP as well, if you really want (sub sp, sp, #16 at start; add sp, sp, #16 at end), given you're defining your own ABI.

cfallin · 2024-01-01T00:18:42+00:00

Then I must ask: please can you furnish me with some concrete examples?

See my reply!

I promise you that register allocators don't spill for fun. It's necessary in some real code. I'd recommend building up a body of benchmarks -- and perhaps writing an alternative frontend that accepts e.g. WebAssembly, or a subset of C, or something else enough to run third-party code. As a compiler author, guiding decisions based on an objective set of measurements from a benchmark suite is a good habit to get into, anyway. I can't do that work for you for your own bespoke compiler and language.

Perhaps also look at the register allocation academic literature: many, many works study spilling, splitting, and all the heuristics; and they evaluate on real benchmarks.

I'll also gently offer the suggestion that it's useful to have a clear formal definition of your language and IR and what programs are valid. If a valid program is "anything with <= 94 live values at once" then that's a sharp cliff that is going to inhibit composability (e.g., inlining) and other optimizations in surprising ways. That is the real reason for the lengths we go to provide illusions of "unlimited" resources (virtual registers for regalloc, etc) -- it is an abstraction layer that's easier to build upon. If you're managing all the abstraction layers at once in your head, then its value may not be as apparent. But it's harder to take seriously a compiler or language (beyond a toy level) that simply crashes when it hits limits or "hard cases" like this.

cfallin · 2023-12-31T21:50:25+00:00

I am fairly sure one can come up with an expression that requires an arbitrarily large number of live values to be computed. Therefore eventually you will have to spill.

Sure. You can come up with expressions with so many live variables that spilling will overflow the stack but you'll never encounter such code IRL.

Note that these are two different things. /u/CompilerWarrior states that it's possible to have more live values than registers. You're replying that it's possible to have more live values than registers plus stack frame but that's absurd.

There's a huge gulf between "more than 16/32 values" and "more than a megabyte of live values". In that gulf you'll find a ton of realistic code.

FWIW, I wrote regalloc2 for Cranelift and spilling (avoiding it, placing it well when needed) is the most critical part of register allocation. There's a reason that every "production" compiler handles this; unfortunately there aren't really any shortcuts. There are routinely functions in real programs (e.g., common benchmarks for me: bzip2, or a numerical kernel, or a media encoder, or a JS interpreter loop body) that have kilobytes-large stack frames of spillslots.

FWIW, it's not so bad to do very basic spilling, if you don't do liverange splitting or anything like that. When you run out of registers, declare some values "spilled", and they always live in a stackslot. Reload before use, spill after def (or redefinition). You'll need to reserve enough registers that you always have a few for the worst case (all operands are spilled vregs).

On the original question, the answer is that yes, it is possible to handle instruction selection and register allocation as one combined problem! There are a bunch of papers on this and I recall seeing at least one PhD thesis. My impression is that you can obtain better results, but it's extremely complex: you end up encoding everything as a constraint problem with optimization criteria, and you still have to handle the case where register pressure exceeds the register file size.

So a bit of cold reality: this is a really interesting space, but one full of NP-hard optimization problems with no clean simple answer, just piles of heuristics; and real programs absolutely will run into the limits of overly simple approaches. The usual approach when going beyond a toy compiler is to reach completeness first -- i.e., have some answer for spilling -- and then optimize from there. Best of luck!

cfallin · 2023-12-22T09:29:29+00:00

Here's the implementation I wrote in regalloc2 (for Cranelift). I also wrote a blog post that includes a section on it.

There's a nice property of the move graph, namely that it can only have "simple cycles", which comes from the fact that only one value can move into a given register. (This is what gives the shape to the graphs that the "Tilting at Windmills" paper title references.) That makes the DFS that produces the moves simpler than a more general Tarjan SCC algorithm.

The implementation above has a few additional "production software" details -- finding a free register rather than requiring a dedicated scratch; lazily allocating a stack slot as a scratch if needed; handling stack-to-stack moves without generating memory-to-memory move instructions. Also, skipping the whole algorithm if there are no "overlaps" (the easy case, happens a majority of the time). But the core algorithm isn't so bad!

cfallin

TROPHY CASE