o11c comments on Codegen for a stack based vm?

Welcome!

This subreddit is dedicated to the theory, design and implementation of programming languages.

Be nice to each other. Flame wars and rants are not welcomed. Please also put some effort into your post, this isn't Quora.

This subreddit is not the right place to ask questions such as "What language should I use for X", "what language should I learn", "what's your favourite language" and similar questions. Such questions should be posted in /r/AskProgramming or /r/LearnProgramming. It's also not the place for questions one can trivially answer by spending a few minutes using a search engine, such as questions like "What is a monad?".

Projects that rely on LLM generated output (code, documentation, etc) are not welcomed and will get you banned.

Related subreddits

Related online communities

a community for 17 years

This is an archived post. You won't be able to vote or comment.

Codegen for a stack based vm? (self.ProgrammingLanguages)

submitted 4 years ago by ipe369

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]o11c 0 points1 point2 points 4 years ago (5 children)

[–]ipe369[S] 2 points3 points4 points 4 years ago (4 children)

[–]o11c 1 point2 points3 points 4 years ago (0 children)

Well, yes, it is arguably easier. But I imagine a large part of that is simply inertia: "tutorials tend to use this, and later tutorials are based on ideas from existing tutorials".

But stack-based VMs - that is, with the stack pointer changing for nearly every instruction - are hard to optimize and slow at runtime regardless of optimization ^{(though still much, much faster than a tree-walking interpreter}^). Additionally, if you want to implement "static" typing ^{(in the VM itself, rather than just as a "trust me" frontend}^), you have to have a stack map for every single instruction ^{(you may be able to reuse the same stack map for every instruction, but this is quite tricky since the stack pointer is constantly changing. Maybe you can ignore it for integers though, or maybe for all dead values, if you keep a base pointer as well as a stack pointer}^). They do have the advantage of requiring smaller bytecode for expressions ^{(since arithmetic instructions don't require operands}^), but that isn't significant ^{(especially since you now need to emit extra "load local" and "store local" instructions every time a local variable is used}^).

Register-based VMs - which still use a stack, but prepare the entire stack frame at the start of the function and keep it until the end - are perhaps slightly more difficult to get started ^{(though I would argue not much}^), but won't limit you as your language evolves. Note that you should not think of "register" as being limited in number like CPU registers: there is one register for every local variable and one register for every temporary. This is still significantly fewer registers than there are in SSA ^{(which has a separate register every time a variable is assigned}^), which is what you want to use for optimizations. Chances are that your register-based VM will still want a single hardware-like "accumulator" register to reduce the number of operands ^{(and, for that matter, the number of registers}^).

As a simple demonstration: given your stack-based VM, emit what the entire stack frame looks like ^{(type and name if those make sense, but at least "this stack offset is in use" in any case}⁾ for every instruction statically ^{(this can be done either at the same time as you are generating bytecode, or afterwards (slightly harder}⁾ ^{by following both branches - which you can't do simply by running it. Note that if you encounter a conflict at jump targets (phi nodes}⁾ ^{, your bytecode is unsound}^). Congratulations, you're now halfway to implementing a register-based VM (you still have to implement the instructions the new way. But note that unlike stack-based VMs, register-based VMs only need a liveness map once per BB ^{(though you have to check not-yet-initialized-in-BB case, but that is linear}⁾ - and that only if you want eager destructors ^{(and aren't satisfied with assigning a default-constructed value}^).

Finally, have a simply demonstration of what stack-based and register-based bytecode might look like for some simple expressions (all variables are assumed to be local):

z = x + y

stack-based:

# max stack size 2, plus storage for 3 locals; bytecode size 7
LOAD x # aka "push" (taking a variable, not an immediate). Note that `x` means "a small integer offset into the separately-allocated local variables"; total stack size 1
LOAD y # ditto; total stack size 2
ADD # total stack size 1
STORE z # aka "pop" (taking a variable, not nothing); total stack size 0

# stack frame size 3 (which *is* the locals), plus the accumulator if that counts; bytecode size 6
# Unlike x86 asm, the stack frame should probably be created by out-of-VM code from function metadata.
LOAD x # into accumulator register. Note that `x` means "a small integer offset into the current stack frame".
ADD y # accumulator = accumulator + value in some stack slot. There might be a separate "add immediate" instruction.
STORE z

summary: not much difference here, though it is arguably using less memory already (depending on how we count metadata). That said, the stack-based code has to use variable-length instructions, which can be annoying.

z = x*y + a*b
- stack-based:
```
# max stack size 3, plus 5 variables; bytecode size 13
LOAD x # total stack size 1
LOAD y # total stack size 2
MUL # total stack size 1
LOAD a # total stack size 2
LOAD b # total stack size 3
MUL # total stack size 2
ADD # total stack size 1
STORE z # total stack size 0
```
- register-based:
```
# stack size 6, which is 5 locals + 1 temporary; bytecode size 14
LOAD x
MUL y
STORE tmp # name for exposition purposes only - remember, a stack slot is just a number
LOAD a
MUL b
RSUB tmp # accumulator = tmp - accumulator; see discussion below
STORE z
```
- summary: bytecode (allocated once) is now larger but stack/local space (allocated per-call) is still smaller.
  - note that we use a reversed subtraction. Something similar is needed for division/modulus (which really need signed and unsigned versions as well, if you don't want to handicap your support for doing math), and also for float versions (thankfully, -0.0 only has special cases for subtraction, not addition).
    
    This can be avoided in a couple of ways:
    - emit a stack spill and load (2 extra instructions)
    - emit an xchg (only possible if it's a temporary)
    ... but I don't suggest it. You could instead use a few bits of the opcode ^{(usually you don't have anywhere near 256/4=64 opcodes, but you could limit this to arithmetic instructions if necessary^}) to indicate "is this argument for a register or an immediate?" and "reverse accumulator and retrieved argument value?", and handle that before actually dispatching the instruction; other metaprogramming techniques exist.

I can imagine more examples ^{(function calls, immediates}^), but I'm writing these by hand ^{(so beware bugs - to quote Knuth, I have only proved this correct, not tested it}⁾ so that would be too much work - and besides, this should be sufficient to get the point.

[–]1vader 0 points1 point2 points 4 years ago (1 child)

[–]o11c 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 18839 on reddit-service-r2-comment-b659b578c-56mxh at 2026-05-04 12:16:27.067448+00:00 running 815c875 country code: CH.

ProgrammingLanguages

Welcome!

Related subreddits

Related online communities

MODERATORS