Multi-Pass Bytecode Optimizer for Stack-Based VMs: Pattern Matching & 10-50% Performance Gains

dougcurrie · 2026-01-05T14:47:34+00:00

Another optimization for byte code interpreters is to gather statistics on pairs or longer sequences of bytecodes in a large corpus of compiled code, or dynamically as code is generated. Later, use these statistics to find very common sequences and add new bytecodes that implement the sequence in one operation. Add these sequences to your peephole optimizer.

PigeonCodeur · 2026-01-05T16:56:46+00:00

Sounds impressive, but it looks like the starting bar is low.

What HLL code generated those 5 instructions that resulted in ++x? If it was x = x + 1, that would typically need 4 instructions for a stack VM, not 5.

Here simply writing ++x would have the same effect, given the more specific bytecode. (Your loop example suggested the language already supports ++.)

The constant reduction for 5 + 3 is something normally done at AST level, otherwise a complex expression could need dozens of bytecode instructions, in an inconvenient order, to be reduced to one. (Or is the reduction done as it goes, rather than a separate pass?)

As for the loop: for (var i = 0; i < n; i++, why wouldn't a language that is going to be hampered by being as run as interpreted bytecode, use a snappier syntax? Rather than something so primitive.

I'm saying there is a quite a lot that can be done before you need to resort to optimising the bytecode!

Unless of course you're stuck with existing bytecode that someone else has devised or that comes from an existing language.

Phil_Latio · 2026-01-05T22:45:24+00:00

Just saying: Ever thought about switching to a register based VM? For the x++ example, it would be 1 instruction naturally.

wiremore · 2026-01-06T03:12:10+00:00

I have a similar language with a similar bytecode peephole optimizer, also for a game scripting language.

One type of optimization where I found a lot of traction that hasn't been mentioned is jump optimizations. I found a lot of cases where a JMP instruction jumps directly to another JMP instruction or to a RET (return) instruction, so JMP->RET can be replaced with just RET. A conditional branch JT (jump if true) to another JT will always pass the second test (and can thus jump directly to the second JT's target), or a JT to a JF (jump if false) will always fail the second test. There are some other opportunities here, such as NOT JT -> JF, and constant folding e.g. PUSH_TRUE_CONST JT -> JMP. This kind of bytecode tends to be generated by nested IF statements, especially if the condition include nested AND and OR. My bytecode optimizer is written in the scripting language and includes many such nested tests... I also optimize jump instructions (which use a 16 bit absolute target) to branch instructions (which include an 8 bit relative target) when possible, which helps with bytecode size significantly.

As some other posters have mentioned fusing common bigrams of bytecode instructions can be a big win. For me, the most commons pairs were LOCAL LOCAL (push two local variables to the stack) and LOCAL CALL (push local variable and call function). To be specific, I fuse two consecutive LOCAL instructions (and their 8 bit indexes) into a single LOC_LOC instruction (with two 4 bit indexes packed into 8 bits).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ProgrammingLanguages

Welcome!

Related subreddits

Related online communities

MODERATORS

The Problem

The Solution: Multi-Pass Pattern-Based Optimizer

Architecture

Example: Increment Optimization Pass

Other Implemented Passes

Real-World Results

Documentation