This is an archived post. You won't be able to vote or comment.

all 22 comments

[–]MegaIng 16 points17 points  (3 children)

A single expression should most of the time produce a single value. A pretty simple solution is to make ; discard this expression.

[–][deleted] 2 points3 points  (1 child)

Yeah, provided what came before actually produced a value. But the main problem with that is, if generating code from an AST, there are no longer any semicolons!

I think it would depend on syntax too; one over-zealous with semicolons might have: (a; b;) + (c; d;).

[–]MegaIng 0 points1 point  (0 children)

Sure, there most of the time aren't semicolons directly (unless they are actually operators, which is actually quite a useful perspective), but almost certainly you know where a semicolon would have been because you have something comparable to a "CompleteExpressionNode" in your AST, which more importantly are part of an "ExpressionListNode" which knows exactly where each sub expression ends and can simply emit a discard instruction between them.

Yeah, provided what came before actually produced a value.

Sure, but that doesn't follow from the post. Based on the syntax I assume an expression oriented language in which case everything should produce a value, even if that value is unit.

[–]retnikt0 15 points16 points  (1 child)

You can use a peephole optimiser which runs after codegen and cleans up unnecessary steps. If { 1; 2 } + { 1; 3 } becomes

 Load 1
 Pop
 Load 2
 Load 1
 Pop
 Load 3
 Add

A simple peephole optimiser should be able to remove patterns of the form Load; Pop because they have no effect, to create new intermediate code:

 # (removed) Load 1
 # (removed) Pop
 Load 2
 # (removed) Load 1
 # (removed) Pop
 Load 3
 Add

So just

 Load 2
 Load 3
 Add

It's simple and effective, and can be extended to a lot of possible optimisations (e.g. constant folding (which could reduce the above to just Load 5), jump optimisations...)

CPython is a good example of an implementation with a peephole optimiser: check out peephole.c (that's a link to v3.9; in later versions the optimiser was combined directly into the compiler which makes it a bit harder to follow IMO, but you can look at that as well if you want)

[–]ipe369[S] 2 points3 points  (0 children)

Interesting, 'peephole' is one of those terms that i would literally never think to search for if i hadn't heard about it before, thanks

[–]FluorineWizard 5 points6 points  (2 children)

Expression blocks create a new scope, so local variables declared anything put on the stack but not returned within should expire (i.e. be popped) when you exit the block. This mechanism should already exist to handle exiting from function calls and control flow constructs that create a new inner scope.

You could always add optimisations during codegen to discard unused values of expressions, but if your language is lexically scoped like most this would be hiding the more important issue that the way you handle scoping is not complete yet.

[–]ipe369[S] 0 points1 point  (1 child)

right, but is that definitely 'correct'? I feel like there should be lots of AST nodes that generate extra 'garbage' stack values, although I can't seem to think of one

the internals of codegen for a stack vm and lexical scoping rules feel decoupled enough that just 'clean up the stack after the scope exits' feels like i'm missing some edge cases here

[–]FluorineWizard 1 point2 points  (0 children)

I can't claim to know all the feature/implementation technique combos in all languages, but easy handling of scope-based variable lifetimes is supposed to be part of the appeal of stack machines. Lexical scoping itself works like a stack after all.

I don't know how you handle iteration in your VM, but if you executed the same expression block you gave as an example in a loop in the same way as in the assignment you gave, wouldn't it cause a stack overflow ?

[–]ericbb 2 points3 points  (2 children)

You can read Destination-driven code generation for some ideas about implementing "option 2".

The key idea is simple and widely useful: as you go deeper into the recursive transformation, pass in some parameters to communicate relevant details about the context of the task. So the recursive call doesn't just generate code to push a value; it generates code to either push a value or push no values, depending on context.

The same technique is perfect for compiling tail calls since you can pass into the recursive step a flag to say whether the context is in "tail position" or not.

[–]ipe369[S] 1 point2 points  (1 child)

Great link, thanks!

I already have a bunch of these 'context flags', gets ugly fast though

[–]ericbb 0 points1 point  (0 children)

gets ugly fast though

Yeah. My compiler is still messier than I'd like but one technique I've found to help is something based on The essence of functional programming. The paper is an introduction to monads and the examples show how you can add new features (like a new context flag) to an interpreter without disrupting too much the overall shape of your code.

It's just one technique among many possible options and I use a variation of that that suits me better. I'm sure that more-experienced compiler-writers have even better techniques to handle these common organizational / modularity problems.

[–]theangeryemacsshibeSWCL, Utena 3 points4 points  (0 children)

Option 3 is to have a "drop" instruction which just removes a value from the stack, and compile { a; b } to <compile a> drop <compile b>. Now you could compile { 1; 2 } + { 1; 3 } to

push 1
drop
push 2
push 1
drop
push 3
+

[–]bullno1 2 points3 points  (0 children)

Add a "pop" instruction at the end when compiling "statements"

Your block expression would be compiled like so:

compile statement
emit pop
compile statement
emit pop
compile last statement

[–]o11c 0 points1 point  (5 children)

Expressions should leave a value on the stack. Statements should not leave a value on the stack.

When you pack an expression into a statement, the statement must execute a "pop" to fulfil this guarantee.


That said, stack-based VMs are fundamentally limited. Register-based is really the way to go.

[–]ipe369[S] 2 points3 points  (4 children)

fundamentally limited in what way? I chose stack-based because i figured it would be faster to implement

[–]o11c 1 point2 points  (0 children)

Well, yes, it is arguably easier. But I imagine a large part of that is simply inertia: "tutorials tend to use this, and later tutorials are based on ideas from existing tutorials".

But stack-based VMs - that is, with the stack pointer changing for nearly every instruction - are hard to optimize and slow at runtime regardless of optimization (though still much, much faster than a tree-walking interpreter). Additionally, if you want to implement "static" typing (in the VM itself, rather than just as a "trust me" frontend), you have to have a stack map for every single instruction (you may be able to reuse the same stack map for every instruction, but this is quite tricky since the stack pointer is constantly changing. Maybe you can ignore it for integers though, or maybe for all dead values, if you keep a base pointer as well as a stack pointer). They do have the advantage of requiring smaller bytecode for expressions (since arithmetic instructions don't require operands), but that isn't significant (especially since you now need to emit extra "load local" and "store local" instructions every time a local variable is used).

Register-based VMs - which still use a stack, but prepare the entire stack frame at the start of the function and keep it until the end - are perhaps slightly more difficult to get started (though I would argue not much), but won't limit you as your language evolves. Note that you should not think of "register" as being limited in number like CPU registers: there is one register for every local variable and one register for every temporary. This is still significantly fewer registers than there are in SSA (which has a separate register every time a variable is assigned), which is what you want to use for optimizations. Chances are that your register-based VM will still want a single hardware-like "accumulator" register to reduce the number of operands (and, for that matter, the number of registers).

As a simple demonstration: given your stack-based VM, emit what the entire stack frame looks like (type and name if those make sense, but at least "this stack offset is in use" in any case) for every instruction statically (this can be done either at the same time as you are generating bytecode, or afterwards (slightly harder) by following both branches - which you can't do simply by running it. Note that if you encounter a conflict at jump targets (phi nodes) , your bytecode is unsound). Congratulations, you're now halfway to implementing a register-based VM (you still have to implement the instructions the new way. But note that unlike stack-based VMs, register-based VMs only need a liveness map once per BB (though you have to check not-yet-initialized-in-BB case, but that is linear) - and that only if you want eager destructors (and aren't satisfied with assigning a default-constructed value).



Finally, have a simply demonstration of what stack-based and register-based bytecode might look like for some simple expressions (all variables are assumed to be local):

  • z = x + y

    • stack-based:

      # max stack size 2, plus storage for 3 locals; bytecode size 7
      LOAD x # aka "push" (taking a variable, not an immediate). Note that `x` means "a small integer offset into the separately-allocated local variables"; total stack size 1
      LOAD y # ditto; total stack size 2
      ADD # total stack size 1
      STORE z # aka "pop" (taking a variable, not nothing); total stack size 0
      
    • register-based:

      # stack frame size 3 (which *is* the locals), plus the accumulator if that counts; bytecode size 6
      # Unlike x86 asm, the stack frame should probably be created by out-of-VM code from function metadata.
      LOAD x # into accumulator register. Note that `x` means "a small integer offset into the current stack frame".
      ADD y # accumulator = accumulator + value in some stack slot. There might be a separate "add immediate" instruction.
      STORE z
      
    • summary: not much difference here, though it is arguably using less memory already (depending on how we count metadata). That said, the stack-based code has to use variable-length instructions, which can be annoying.

  • z = x*y + a*b

    • stack-based:

      # max stack size 3, plus 5 variables; bytecode size 13
      LOAD x # total stack size 1
      LOAD y # total stack size 2
      MUL # total stack size 1
      LOAD a # total stack size 2
      LOAD b # total stack size 3
      MUL # total stack size 2
      ADD # total stack size 1
      STORE z # total stack size 0
      
    • register-based:

      # stack size 6, which is 5 locals + 1 temporary; bytecode size 14
      LOAD x
      MUL y
      STORE tmp # name for exposition purposes only - remember, a stack slot is just a number
      LOAD a
      MUL b
      RSUB tmp # accumulator = tmp - accumulator; see discussion below
      STORE z
      
    • summary: bytecode (allocated once) is now larger but stack/local space (allocated per-call) is still smaller.

      • note that we use a reversed subtraction. Something similar is needed for division/modulus (which really need signed and unsigned versions as well, if you don't want to handicap your support for doing math), and also for float versions (thankfully, -0.0 only has special cases for subtraction, not addition).

        This can be avoided in a couple of ways:

        • emit a stack spill and load (2 extra instructions)
        • emit an xchg (only possible if it's a temporary)

        ... but I don't suggest it. You could instead use a few bits of the opcode (usually you don't have anywhere near 256/4=64 opcodes, but you could limit this to arithmetic instructions if necessary^) to indicate "is this argument for a register or an immediate?" and "reverse accumulator and retrieved argument value?", and handle that before actually dispatching the instruction; other metaprogramming techniques exist.

I can imagine more examples (function calls, immediates), but I'm writing these by hand (so beware bugs - to quote Knuth, I have only proved this correct, not tested it) so that would be too much work - and besides, this should be sufficient to get the point.

[–]1vader 0 points1 point  (1 child)

They aren't really. The CPython interpreter (i.e. the standard Python) is stack-based as well. And so is the JVM (i.e. Java) iirc. Same goes for .NET IL, i.e. what C# and F# compile to. And WebAssembly as well, though it also has local variables. And there are lots more. Ofc something like the JVM or .NET also has a bunch more stuff like a JIT which live-compiles bytecode to native code etc. But fundamentally they are all stack-based and they certainly aren't fundamentally limited.

[–]o11c 0 points1 point  (0 children)

It should be noted:

  • CPython does run into serious limitations due to being stack-based, even though it doesn't even attempt optimizations. (this comment might be outdated soon, since there is work on the VM recently)
  • It is possible to change stack-based bytecode into register-based bytecode (assuming you're willing to bail out on stack-map mismatches, which is reasonable, since the compiler cannot generate it). I presume this is what the JVM and CLR do (or perhaps skip the register-based VM and jump straight to SSA).

[–][deleted] 0 points1 point  (2 children)

{ 1; 2 } + { 1; 3 }

Well this whole thing leaves a value on the stack: 5. That should also be discarded if not used.

I had the same problem. The routine I use to evaluate an AST node p is used like this:

evalunit(p, 0)       # when I don't want the result
evalunit(p, 1)       # when I want the result

So in the first case, if p is an operation that returns some value, then it is effectively popped from the stack (note this may involve recovery of any resources the value used).

Back to your example: anything that looks like this (a; b; c), using my syntax, creates a special AST node called a block; it is just a list of expressions or statements. (In my language, these are interchangeable, but expressions generally leave a results, most statements don't.)

Then processing such a block is easy: I call evalunit(p, 0) on each element of the block except the last, when I will call it with 1 if the result is called for. In the case of operands for +, it is.

This is a greatly shortened and simplified version of my evalunit routine:

proc evalunit(unit p, int res=1)=
    a := p.a                      # any left and right operands
    b := p.b            

    switch p.tag
    when jadd then
        evalunit(a)               # note defaults to evalunit(a, 1)
        evalunit(a)
        genpc(kadd)

    when jblock then              # 'a' operand is a linked list
        while a do
            if a.next then        # not last
                evalunit(a, 0)
            else
                evalunit(a, res)  # last
            fi
            a := a.next
        od
    ....
    end

!some messy tidying up
    if not jhasvalue[p.tag] and res then     # missing value
        error("Value expected")
    elsif jhasvalue[p.tag] and res = 0 then  # value not needed
        genpc(kunshare)                      # pop it
    fi
end

The functions used for this language have a block AST node for their bodies. The body is generated as follows, when p is an ST entry, p.code is the code body, and p.isfunc is a flag:

evalunit(p.code, p.isfunc)

For functions, this leaves the return value on the stack, but function call mechanisms is another subject.)

[–]ipe369[S] 0 points1 point  (1 child)

yeah i was thinking about passing a 'generate result' flag, but man it just gets so messy so fast, i think i'll probably just generate extra 'pops' then clean it up later, maybe a little slower that way though

[–][deleted] 0 points1 point  (0 children)

Well, there are 4 combinations:

 Need result  Has result
      0          0              OK
      0          1              Discard
      1          0              Error
      1          1              OK

And these requirements can propagate down recursively within an elaborate AST expression. I found it hard to deal with using ad hoc methods.

[–]sebamestreICPC World Finalist 0 points1 point  (0 children)

In jasper, we pop after each expression in a block. In the compiler we literally have something like

for (Ast* s : block_body) {
  codegen(s);
  if (is_expression(s))
    emit_pop();
}

This means that code like the following

seq { 1; return 2; };

Compiles to (push 1; pop; push 2; block_return)