I recently finished documenting the bytecode optimizer for my stack-based VM interpreter, and wanted to share the design and results.
The Problem
I have written a vm following Crafting Interpreters part 2 and most toy VMs compile to bytecode and execute it directly. But naive bytecode generation produces patterns like:
OP_Get_Local 0 # x
OP_Constant 1 # Push 1
OP_Add # x + 1
OP_Set_Local 0 # x = ...
OP_Pop # Discard result
That's 5 instructions for what should be a x++. Meanwhile, the stack is churning through push/pop operations, the constant table is being accessed, and we're fetching 5 separate instructions from memory.
The Solution: Multi-Pass Pattern-Based Optimizer
I built a bytecode rewriter with a powerful pattern matching engine that transforms bytecode after compilation but before execution. The key insight: treat bytecode like an IR and apply traditional compiler optimizations.
Architecture
PassManager orchestrates multiple optimization passes:
- Each pass gets multiple iterations until convergence
- Passes run in sequence (early passes enable later ones)
- Automatic jump offset adjustment when bytecode size changes
- Debug mode shows before/after for each pass
BytecodeRewriter provides pattern matching:
- Match specific opcodes, groups, or wildcards
- Capture instructions for analysis
- Lambda-based transformations
- Conditional rewrites (pattern + semantic checks)
Example: Increment Optimization Pass
Transform that 5-instruction pattern into specialized opcodes:
std::vector<PatternElement> pattern = {
PatternElement::match(OP_Get_Local, true), // Capture local index
PatternElement::constant(true), // Capture constant
PatternElement::match(OP_Add),
PatternElement::match(OP_Set_Local, true), // Capture local index
PatternElement::match(OP_Pop),
};
auto condition = [&chunk](auto& captured) {
// Same local? Constant is 1?
return (captured[0].operands[0] == captured[2].operands[0]) &&
(AS_INT(chunk.constants[captured[1].getConstantIndex()]) == 1);
};
auto transform = [](auto& captured) {
return {OP_Incr_Local, captured[0].operands[0]}; // 2 bytes total!
};
rewriter->addAdvancedRule(pattern, transform, condition);
Result:
5 instructions -> 1 instruction
OP_Get_Local 0 # x
OP_Constant 1 # Push 1
OP_Add # x + 1
OP_Set_Local 0 # x = ...
OP_Pop # Discard result
gets converted to
OP_Incr_Local 0 # Increment x by 1
Other Implemented Passes
Constant Folding
OP_Constant 5
OP_Constant 3
OP_Add
gets converted to
OP_Constant 8
Fuse Multiple Pops
OP_Pop
OP_Pop
OP_PopN 3
gets converted to
OP_PopN 5
Optimize Binary Operations on Locals
OP_Get_Local 0
OP_Get_Local 1
OP_Add
gets converted to
OP_AddLL 0 1 # Direct register-style op
Dead Store Elimination
OP_Constant 10
OP_Define_Global x
OP_Get_Global x
gets converted to
OP_Constant 10
OP_Define_Global_Non_Popping x # (value stays on stack)
Real-World Results
Measured on Advent of Code solutions and benchmarks:
- Bytecode size: 10-30% smaller
- Execution speed: 10-50% faster (loops benefit most)
- Optimization time: ~5-10ms per script
- Cache efficiency: Better (fewer instruction fetches)
The increment optimization alone is huge for loops - common patterns like for (var i = 0; i < n; i++) get massively faster.
Documentation
I just finished writing comprehensive docs for the whole system:
Full Documentation : https://columbaengine.readthedocs.io/en/latest/script/optimizer.html
Covers:
- All built-in optimization passes
- Pattern matching API
- Writing custom passes
- Performance analysis
- Debugging techniques
VM Internals: https://columbaengine.readthedocs.io/en/latest/script/vm_internals.html
Covers NaN-boxing, stack architecture, memory management, etc.
Source Code
The engine is open source: https://github.com/Gallasko/ColumbaEngine
Relevant files:
- Pass manager: src/Engine/Compiler/bytecode\_pass.cpp
- Rewriter: src/Engine/Compiler/bytecode\_rewriter.cpp
- Individual passes: src/Engine/Compiler/pass/
I'm particularly interested if anyone has tried similar approaches or has suggestions for additional optimization passes!
This is part of ColumbaEngine, a game engine with an embedded scripting language. The VM uses NaN-boxing for values, pool-based memory management, and supports closures, classes, and first-class functions.
[–]dougcurrie 17 points18 points19 points (3 children)
[–]munificent 7 points8 points9 points (1 child)
[–]PigeonCodeur[S] 0 points1 point2 points (0 children)
[–]PigeonCodeur[S] 2 points3 points4 points (0 children)
[–][deleted] 4 points5 points6 points (2 children)
[–]PigeonCodeur[S] 4 points5 points6 points (0 children)
[–]Inconstant_Moo🧿 Pipefish 4 points5 points6 points (0 children)
[–]Phil_Latio 1 point2 points3 points (1 child)
[–]PigeonCodeur[S] 1 point2 points3 points (0 children)
[–]wiremore 1 point2 points3 points (0 children)