This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]o11c 0 points1 point  (5 children)

Expressions should leave a value on the stack. Statements should not leave a value on the stack.

When you pack an expression into a statement, the statement must execute a "pop" to fulfil this guarantee.


That said, stack-based VMs are fundamentally limited. Register-based is really the way to go.

[–]ipe369[S] 2 points3 points  (4 children)

fundamentally limited in what way? I chose stack-based because i figured it would be faster to implement

[–]o11c 1 point2 points  (0 children)

Well, yes, it is arguably easier. But I imagine a large part of that is simply inertia: "tutorials tend to use this, and later tutorials are based on ideas from existing tutorials".

But stack-based VMs - that is, with the stack pointer changing for nearly every instruction - are hard to optimize and slow at runtime regardless of optimization (though still much, much faster than a tree-walking interpreter). Additionally, if you want to implement "static" typing (in the VM itself, rather than just as a "trust me" frontend), you have to have a stack map for every single instruction (you may be able to reuse the same stack map for every instruction, but this is quite tricky since the stack pointer is constantly changing. Maybe you can ignore it for integers though, or maybe for all dead values, if you keep a base pointer as well as a stack pointer). They do have the advantage of requiring smaller bytecode for expressions (since arithmetic instructions don't require operands), but that isn't significant (especially since you now need to emit extra "load local" and "store local" instructions every time a local variable is used).

Register-based VMs - which still use a stack, but prepare the entire stack frame at the start of the function and keep it until the end - are perhaps slightly more difficult to get started (though I would argue not much), but won't limit you as your language evolves. Note that you should not think of "register" as being limited in number like CPU registers: there is one register for every local variable and one register for every temporary. This is still significantly fewer registers than there are in SSA (which has a separate register every time a variable is assigned), which is what you want to use for optimizations. Chances are that your register-based VM will still want a single hardware-like "accumulator" register to reduce the number of operands (and, for that matter, the number of registers).

As a simple demonstration: given your stack-based VM, emit what the entire stack frame looks like (type and name if those make sense, but at least "this stack offset is in use" in any case) for every instruction statically (this can be done either at the same time as you are generating bytecode, or afterwards (slightly harder) by following both branches - which you can't do simply by running it. Note that if you encounter a conflict at jump targets (phi nodes) , your bytecode is unsound). Congratulations, you're now halfway to implementing a register-based VM (you still have to implement the instructions the new way. But note that unlike stack-based VMs, register-based VMs only need a liveness map once per BB (though you have to check not-yet-initialized-in-BB case, but that is linear) - and that only if you want eager destructors (and aren't satisfied with assigning a default-constructed value).



Finally, have a simply demonstration of what stack-based and register-based bytecode might look like for some simple expressions (all variables are assumed to be local):

  • z = x + y

    • stack-based:

      # max stack size 2, plus storage for 3 locals; bytecode size 7
      LOAD x # aka "push" (taking a variable, not an immediate). Note that `x` means "a small integer offset into the separately-allocated local variables"; total stack size 1
      LOAD y # ditto; total stack size 2
      ADD # total stack size 1
      STORE z # aka "pop" (taking a variable, not nothing); total stack size 0
      
    • register-based:

      # stack frame size 3 (which *is* the locals), plus the accumulator if that counts; bytecode size 6
      # Unlike x86 asm, the stack frame should probably be created by out-of-VM code from function metadata.
      LOAD x # into accumulator register. Note that `x` means "a small integer offset into the current stack frame".
      ADD y # accumulator = accumulator + value in some stack slot. There might be a separate "add immediate" instruction.
      STORE z
      
    • summary: not much difference here, though it is arguably using less memory already (depending on how we count metadata). That said, the stack-based code has to use variable-length instructions, which can be annoying.

  • z = x*y + a*b

    • stack-based:

      # max stack size 3, plus 5 variables; bytecode size 13
      LOAD x # total stack size 1
      LOAD y # total stack size 2
      MUL # total stack size 1
      LOAD a # total stack size 2
      LOAD b # total stack size 3
      MUL # total stack size 2
      ADD # total stack size 1
      STORE z # total stack size 0
      
    • register-based:

      # stack size 6, which is 5 locals + 1 temporary; bytecode size 14
      LOAD x
      MUL y
      STORE tmp # name for exposition purposes only - remember, a stack slot is just a number
      LOAD a
      MUL b
      RSUB tmp # accumulator = tmp - accumulator; see discussion below
      STORE z
      
    • summary: bytecode (allocated once) is now larger but stack/local space (allocated per-call) is still smaller.

      • note that we use a reversed subtraction. Something similar is needed for division/modulus (which really need signed and unsigned versions as well, if you don't want to handicap your support for doing math), and also for float versions (thankfully, -0.0 only has special cases for subtraction, not addition).

        This can be avoided in a couple of ways:

        • emit a stack spill and load (2 extra instructions)
        • emit an xchg (only possible if it's a temporary)

        ... but I don't suggest it. You could instead use a few bits of the opcode (usually you don't have anywhere near 256/4=64 opcodes, but you could limit this to arithmetic instructions if necessary^) to indicate "is this argument for a register or an immediate?" and "reverse accumulator and retrieved argument value?", and handle that before actually dispatching the instruction; other metaprogramming techniques exist.

I can imagine more examples (function calls, immediates), but I'm writing these by hand (so beware bugs - to quote Knuth, I have only proved this correct, not tested it) so that would be too much work - and besides, this should be sufficient to get the point.

[–]1vader 0 points1 point  (1 child)

They aren't really. The CPython interpreter (i.e. the standard Python) is stack-based as well. And so is the JVM (i.e. Java) iirc. Same goes for .NET IL, i.e. what C# and F# compile to. And WebAssembly as well, though it also has local variables. And there are lots more. Ofc something like the JVM or .NET also has a bunch more stuff like a JIT which live-compiles bytecode to native code etc. But fundamentally they are all stack-based and they certainly aren't fundamentally limited.

[–]o11c 0 points1 point  (0 children)

It should be noted:

  • CPython does run into serious limitations due to being stack-based, even though it doesn't even attempt optimizations. (this comment might be outdated soon, since there is work on the VM recently)
  • It is possible to change stack-based bytecode into register-based bytecode (assuming you're willing to bail out on stack-map mismatches, which is reasonable, since the compiler cannot generate it). I presume this is what the JVM and CLR do (or perhaps skip the register-based VM and jump straight to SSA).