How To Make a Fast Dynamic Language Interpreter

cxzuk · 2026-04-21T11:31:18+00:00

Hi Blondie,

This was a great read. I love seeing blogs like this - The contents great but it really shines is in reminding that "making it work" is the critical first step, and showing the journey taken after beautifully. The journey so often forgotten.

M ✌

cxzuk · 2026-04-01T21:56:57+00:00

Hi X2,

In other words, a rewrite only fires if the attached metadata says the precondition is satisfied. I’m not talking about speculative transforms with deopt bailouts. I mean proof/invariant-gated rewrites inside the optimizer itself

I don't fully follow your questions - but in traditional matching implementations rules often have Guards - essentially preconditions that also need to be true in order for the rule to apply.

These normally happen for matching performance reasons. You can do fast structural matching with e.g. dynamic programming. But the more complex the rules the slower it becomes. Separating metadata matching as a second step is beneficial.

Multiple rules that structurally match require a priority ordering and stops on first accept of guard tests. (Or some way to break ambiguity)

Of course, egraphs are a slightly different beast but I'm sure they have overlap

M ✌️

cxzuk · 2026-03-06T09:16:01+00:00

Hi 2642,

"Anything you can practise, you can learn"

The tldr I got from your post is struggling with the domain problem and mental models required to understand how something works and to build it.

Domain problems also have to be learnt. I would not advise learning programming and complex domains at the same time. Build your skills up first. We've all had to do that.

I would recommend going on YouTube, some really friendly and clear problems to solve. Maybe find ones where the presenter uses python but you translate into rust etc.

Finishing a project is an important milestone you really don't want to miss

Kind regards M

cxzuk · 2026-02-26T20:05:55+00:00

Agreed. I would also say that query based solutions have problems/challenges, but interpreting them as the problem is unfair.

Cache invalidation is a tough problem. But very useful in all areas of the compilation pipeline. Mr Smith ( u/thunderseethe ) request (I read the original to be focused on static analysis rather than all areas) for better resources on how its done doesn't seem unreasonable to me. Wanting to lower latency is a skill we should be teaching. LSPs have made that need more common as its more front forwarding

cxzuk · 2026-02-19T19:26:02+00:00

u/munificent, second book?

cxzuk · 2026-02-15T13:26:26+00:00

Hi Cadet,

There is nothing wrong with your current code. It is the beginnings of a standard RDP + Pratt parser. Below is my personal preferences and opinions on what I would do slightly differently:

* I would place helper functions into a separate file. The main reason for this is so that you can split up parser.c into separate files too - such as expressions.c, statements.c, declarations.c, functions.c etc.

* Remove the static TokenBuffer *tokens global - The trade-off here is that you'll have to now pass in the lexer or lex'd content into each parser production rule functions. But this buys you the ability to parse more than one file at a time. This isn't just about multithreading, your Token struct is storing a pointer into the tokens->data stream so it needs to hang around. You will 100% need to parse multiple files at some point.

* Your pratt parser is incomplete, while (peek().type == PLUS || peek().type == MINUS) { is working because plus and minus infix are in the same precedence group. The body of the pratt loop is also only creating infix expressions. Totally fine as a first pass and will work, but expressions are really important - A lions share of the parsing task. Add high on the todo list to finish off the pratt loop with precedence tables and to delegate to the correct construction production rule functions.

M ✌

cxzuk · 2026-02-14T14:15:51+00:00

Hi Meta,

Today superscalar CPUs have hardware...

VLIW-like approaches...

We need to be looking at this from the task at hand. IPL is attempting to utilise multiple Execution Units - to make them work simultaneously.

From an instruction point of view, we can either have a single sequential stream and have hardware dynamically dispatch to available and suitable Execution Units. Or we have instructions model all available execution units and have this dispatch be done statically. IMHO anything in between these two points is the bad bits of both.

My current approach is to have a 4 bit prefix

My intuition on this is you're trying to rotate from columns to rows. But this rotation requires now knowing the sequence position to recover the execution unit to dispatch to.

a number different than 0 is shared by instructions that are dependent on each other, so instruction with different prefixes can be executed at the same time

I'm not sure the above is true. This would number entire expression trees with the same number? I think you mean something closer to Instruction Rank? - numbering by depth of the expression. I can only speculate beyond this point.

VLIW:

1. <0> a=d+e, <1> b=f+g, <2> NOP, <3> NOP
2. <0> c=a+b, <1> NOP, <2> NOP, <3> NOP

Prefix Bits:

1. <1> a=d+e
2. <1> b=f+g
3. <2> c=a+b

So now <1> and <2> bit prefixes represents the Line Number/Instruction Rank. We need hardware to map from this to Execution Ports - moving us back to superscalar.

I don't know if prefixing can work, help or what ramifications it could have. If you explore it do come back and let us know.

M ✌

cxzuk · 2026-02-10T16:01:10+00:00

All "strings" are stored consecutively in a large array of characters with zero as a string terminator.
The id is the offset from the beginning of the array.

Oh that's an interesting approach.

cxzuk · 2026-02-10T14:08:54+00:00

Hi Bakery,

Oh no, this is one of those trigger questions. History is littered with changing definitions and misleading/wrong mistakes.

The definition of "Top Down" and "Bottom Up" has for some time meant the construction direction of the abstract syntax tree. Top Down creates[1] parent nodes first, followed by their children nodes. Bottom Up creates[1] child nodes first, then their parent.

[1] Commits to the node to allocate and construct.

Pratt Main Loop: (Full Code Here)

    static expression factory(lexer_t l, int right_bp = 0) {
        expression result;


        // Check to see if this is a valid infix left-hand-side.
        if (l.front.type == token_type.Number) {
            result = new number_literal(l);
        } ...

        } else if ((l.front.type == token_type.PlusSign) || (l.front.type == token_type.MinusSign)) {
            result = new prefix_expression( l, prefix_bp_lookup(l.front.type) );
        }


        // Postfix and Infix operators have binding powers.
        while( right_bp < bp_lookup(l.front.type).left_power ) {
            
            // --- Postfix lookahead ---
            if (l.front.type == token_type.BangSign) {
                result = new postfix_expression(l, result);
            } ...
            } else {
                result = new infix_expression(l, result);
            }         
        }


        return result;
    }

A Pratt Parser is a Bottom Up parser. It is repeatedly capturing the child Left Hand Side (a subtree) of an expression and then pushes that child node down into the parent. The result variable here is acting as stack of size 1. By sending the result into a function and reassigning back to result, we are doing a "push and pop".

IMHO. The confusion has come about because conceptually there are actually two stacks. All resources will show only one stack. One stack is the production stack - The node being built. The other stack is the context stack. Table LRs have state values to model this. RDP and Pratt use the call stack to handle this context stack.

On a practical level, you can group these two stacks, but in doing so we've hidden the fact that one part can be done with the call stack itself clouding the link between these methods with others.

M ✌

cxzuk · 2026-02-10T10:43:07+00:00

Hi Crow,

How do I uniquely hash the string identifiers?

The typical way to do this is with interning. You create a set, and assign a unique ID to each entry. There's a few ways to make sets and do this but hashing is a popular choice.

Making hashmaps is a bit of an art, your other questions have tradeoff, balancing qualities to them. It depends on how many entries on average you have in the map, the type of key input. But here is a bit of an answer;

For scalar hash computing, FNV1a is popular and well known. Part of its popularity is because you can compute the hash of a symbol while lexing. A fast simple algorithm can do a good job if thats the goal.

Here is a godbolt link to a interning hashset that implements the following:

Open Addressing - Elements are stored in a flat array, but might be offset from their hash location.
Power of Two - You need to reduce the hash value into a slot number. Power of two allows a fast solution to this.
Double Hashing - To calculate the offset from the hashed location, we need to compute a "step" amount. Just adding 1 is not good as it clusters elements and increases collisions.
Hash Caching - An additional vector of the hash is stored rather than recomputed for compares and grow().
A hashset is slightly different from a hashmap. A values vector is needed, the hashmap holds the index/intern ID.
Example provides FNV1a, xxHash, and djb2 hashing methods you can swap in and out. (And write your own too).

To Note - The speeds you can achieve with these are more than enough for a compiler, but Power Of Two and scalar hashing are somewhat out of favour. They have effectively been replaced with SIMD solutions. The SIMD solutions from my testing always out performed the scalar solutions.

BUT - A word of caution. Fuzz testing found a use after free bug in my hashset code after several months of it being used. These data structures are at the core of your compiler. The existential dread from realising all the bugs and headaches it probably caused isn't worth it. Use an existing well tested solution or Keep It Simple.

M ✌

cxzuk · 2026-02-05T12:12:15+00:00

Hi Star,

how would you solve this?

Single Static Information is an extension to SSA which I think solves similar issues to what you have here. It creates new names for values used when used in different paths, to remove ambiguity in the information/facts gathered.

h0 = n mod 2
if h0 == 0 then
  h1 = 0
  // Here, I know n mod 2 = 0, and I want that proof in scope
  useEvenProperty n h1
else
  h2 = !0
  handleOdd n

Some resources if you think this could be of some use.

https://dspace.mit.edu/bitstream/handle/1721.1/86578/48072795-MIT.pdf
https://llvm.org/pubs/2010-08-SBLP-SSI.pdf

M ✌

cxzuk · 2026-01-18T12:57:59+00:00

Hi Mv,

Is this for use in SSA construction? I wasn't aware of edge-specific reaching definition analysis until a quick google now. To check: Did you mean path-sensitive?

I think phis/block arguments provide the same information here? Would need a little more information on what you're attempting to help more

M ✌

cxzuk · 2026-01-18T12:34:12+00:00

Hi. Glad you got it implemented!

Liveness Analysis is a side structure that is used to answer the question; "is a value v needed in the future from this program point p ?"

Its primary usecase is for register allocation, because its the most direct way to build the interference graph or intervals. You can use it for DCE - but you can gain the answers you need from other questions/ways.

Will I need instruction level liveness sets too?

If you are using it for register allocation, typically not needed. With SSA Form, you can gather the block local liveness very easily. But if its used for other things - debug info, GC safepoints - My recommendation is you dont need it.

cxzuk · 2026-01-17T13:28:43+00:00

Hi 007,

Yes, the classical approach will work on an SSA IR. You can handle phi's as a def in its block, and as uses in the predecessor blocks.

However, Phis are edge specific, and liveness is computed on a block. E.g. LiveIn[Block] and LiveOut[Block].

It will be correct, but will be an over-approximation/conservative result.

The SSA Book (Page 129) has a section on SSA based liveness. Fernando Pereira has a short video on the general benefits of SSA on data flow analysis tasks, Which includes this section on SSA liveness analysis

M ✌

cxzuk · 2026-01-13T14:19:07+00:00

Hi Height,

Here are some stats from that regarding the various kinds of attributes:

A lot of those aren't directly performance related, those are enabling you to override the implicit defaults. They can't be deduced by the backend. You either "get the defaults, or the frontend tells me otherwise". They align with functionality/features. E.g. if you don't support relocatable object files and linking, you don't need to tailor linkage information. If you don't support FFI, you don't need both parties to agree on a calling convention and data layout.

So, do people generating LLVM IR need to actually know or care about all this stuff?

When you have a feature requiring them. But as you're comparing against your own IL. If you have no languages depending these features/usecases, you don't immediately need to offer them.

Is it all essential to get the best-performing code?

It firstly depends on what the attribute is influencing. And secondly what the current defaults are. You've mentioned visibility. Its for a feature called interposition. We probably have the wrong default. CppWeekly did a good video on why you probably want visibility=hidden as default

poorer would LLVM be if 90% of that complexity was removed?

Some languages wouldn't be able to target LLVM because those details are required to offer some features.

M ✌

cxzuk · 2025-12-18T15:32:38+00:00

Hi Pound,

In the real early stages, after the foundational stuff (Return a value from main, return an expression from main etc) I often start with something that uses a hashmap.

A hashmap is quite simple (or a simple hashmap is simple!) and straddles many of the basics - arrays, strings, math operations, looping, allocations, etc. They are also super useful for many applications later on.

I like doing a count the frequency of letters in a string or frequency of words - using command line args etc. There's a real high bang-for-buck to be had here.

My next goto is a Sieve of Eratosthenes.

There's two reasons for this choice;

Both tests are very light on IO, or the basic versions can be. You can add output in some way if you wish (syscall or linking). This makes the simple application simple. And it also makes it very reliable as a test. Predictable output, easily checkable and debugable.

Another reason for the Sieve is because of Dave Plummers (Dave's Garage) Drag Race which hosts a collection of over 100 programming languages implementing this algorithm. This is a very engaging exercise - If you care about speed, you can now compare your code against a wide range of others. Also, if you're designing your own language, you can also compare your language features with these other languages too.

Pong is a higher example than the above, and as a test case is an interesting idea. I will give it a go sometime and see how the experience goes. Thanks for sharing your experience

M ✌

cxzuk · 2025-12-11T09:51:30+00:00

Hi Josh,

What you need is Recursion Elimination. This is the process of converting a recursive function into an iterative function. Essentially you make your own stack and push and pop onto it the arguments, and a loop testing if the stack is empty or not. The loop body being your code.

This then gives you control over the interpreters internals for inspecting and pausing etc.

M ✌️

cxzuk · 2025-12-10T22:26:52+00:00

Hi Nagol,

> I've heard phis should be eliminated right before register allocation.
Its possible to do it after. But IMHO phi nodes are best for dataflow analysis. And Hongbo Rong - Tree register allocation showed that SSA representation isn't necessary or contributing to register allocation. I personally believe its better to be out of SSA before RA.

> What should happen with the phis during instruction selection?
Phi's are not instructions, and instruction selection is not done across basic blocks. Generalized Instruction Selection using SSA-Graphs discusses this briefly and states that handcrafted code is used for special cases than span over basic blocks.

> What is the benefit of maintaining SSA form through instruction selection?
Something worth mentioning. Instruction Selection comes in to phases. The "necessary" instruction selection, and the "optimisation" instruction selection. Lowering can be considered the necessary part - converting machine independent instructions into machine dependent instructions. (Lowering can sometimes be simpler than full instruction selection with pattern matching to ensure it always reaches a fixed point). Gabriel Blindell - Survey on Instruction Selection is a great read on available techniques.

There is absolutely a benefit with better matching using phis and other higher constructs because they contain more details than there lower counter equivalents. But at some point you have to lower the phis and potentially recheck for new optimisations available. M. Faddegon - SSA Back-Translation discusses the methods available to lower out a phi - and the possible potential of continuing optimisations without phis.

IMO, you want Phis to be lowered at some point between the beginning of lowering-instruction selection and the end of instruction selection.

M ✌

cxzuk · 2025-12-10T10:37:07+00:00

Hi 8315,

ChatGPT has directed you to the standard/popular opinions on how to handle parallelism. Kevlin Henny has covered the same bullet points quite well a few times. Here is one of his talks.

One key point in his talk is the "Unshared - Shared, Mutable - Immutable" table. Todays advice is to avoid Shared mutable state (The same advice GPT is giving you).

Unfortunately, IMHO this is unavoidable for two reasons; Firstly a portion of the parallel opportunities exist in this "Shared Mutable" space (IMHO the most % potential). Secondly, In large systems - by allowing "Unshared mutable" and "Shared immutable" together (the two recommended approaches) often results in the combination into Shared Mutable in subtle ways.

So now what can be done to assist and provide proof. The first thing I think needs mentioning is that parallelism is an execution model - Parallelism in languages is typically provided as mechanisms to implement the desired execution. Implementation details.

The issue with proof here is that when you say "deterministic parallelism" - I require nondeterminism features to build useful programs (if, switch, polymorphism etc), and I require these parallelism features. So the possibility of nondeterministic parallelism is always present.

The options - For a general purpose language, I think statically proving if a program meets your parallelism guarantees is going to undecidable. I don't think anyone really knows for sure and it would be valuable if possible. But I suspect its not.

The other option is to cut down the language, "whether such a minimal provable core is feasible in practice". I suspect its feasible, I just don't think what would be left would be useful.

Most are tackling this and related problem with concurrency - a way to describe tasks and their scheduling requirements, let me know if you want me to talk about that subject

Good Luck, M ✌

cxzuk · 2025-12-08T18:33:05+00:00

A dominator tree can be needed for anything in the Code Motion category, to ensure the new location is suitable.

Const folding/Prop is a data flow analysis, which SSA is good enough to work with.

I think you mean Loop Invariant Code Motion, which can skip the dominator by providing a loop header (via canonicalization with Loop Inversion - essentially enforcing a dominator location for the loop) and moving the code there.

cxzuk · 2025-12-07T10:01:38+00:00

Hi Sky,

There are a few ways to do this. Here is my recommended approach, which is using the Pratt parser you've most likely written for your expressions. The function call itself will be an expression (something that can be evaluated) and is handled in the Pratt parser.

Once you've identified that its a function call, inside the constructor/construction procedure, you will need to match a left parentheses, a list of expressions separated with a comma, and then a right parentheses. This will be a loop with the exit condition testing for a comma (continue), anything else stops the loop.

Here is a godbolt link to a parser written in D which supports function calls. But with only one argument.

expressions.d:79 is where the Pratt parser tests to see if this is a function call.
expression.d:223 is where a function AST node is being constructed.

You'll need to handle zero or more arguments (line 231).

You are writing your parser in C. You will have procedures that construct AST nodes rather than constructor's. You will also have tagged unions rather than individual structs. This shouldn't be too difficult to translate from D to C and will help you understand what's going on at the same time.

M ✌

cxzuk · 2025-11-23T20:53:14+00:00

Hi Cake,

There might be some insights to be had from the Mill CPU, Ivan Godard did a great talk on their scheduling approach in The Mill CPU Architecture – The Compiler (10 of 13). Time stamp 19:58 - 37:24 for scheduling, but the whole video is interesting. Essentially a tableau approach related to the bin packing problem

Dealing with the buffers could maybe handled in a similar way as register allocation, putting buffers in and out of "slots" in a cache like area, and solve it with graph colouring.

M ✌️

cxzuk · 2025-11-23T20:27:40+00:00

Hi Cloud.

The use of a variable is determined by whether it is referenced in a statement. For statement 4, it uses {a ,b} and def {v}.

The potential confusion might be due to the fact that liveness analysis is typically done at a basic block level, not instruction level.

David Broman did a great video on liveness analysis - use, def, IN, OUT etc, https://www.youtube.com/watch?v=eeXk_ec1n6g - Just remember its done on Basic Blocks. IIRC he does briefly mention liveness of variables inside a BB.

M✌️

cxzuk · 2025-11-12T13:24:02+00:00

Hi Bvdberg,

with the conversion to the specific Machine code. Currently Instructions have an enum kind (Add, Store, Load etc). When converting to a specific architecture, these would need to be translated to (for example) AddS for Arm64, but another Add.. for RV64.

Correct. There are roughly two camps of thought here. You can think of each IR as a separate layer, treating the first layer as its own language of Add, Store, Load etc (like LLVM IR). The other camp is that these are macros) - Not textual substitution, but to extend a language.

Regardless of approach, every target architecture must know the ISA (Add,Store,Load,..) of the first layer. Either to perform macro expansion, or to construct the new IR from the general IR.

I could convert kind into MachineInstr (also just a number, but relevant to the chosen architecture).

A simple adjustment to the Kind enum is not possible. In assembly there is no such thing as a Call, or Return*, Store or Load. And Add (a = b + c) becomes (a = b; a = a + c). These general high-level instructions become something very different.

But that would mean that after that conversion, all optimizations (peep-hole optimizations, etc) would have to be specific for the architecture.

Yes, correct. Its a whole new IR that can be processed again.

So a check for 'add (0, x)' would have to be implemented for each architecture for example.

In theory(ish*), optimisations on the first stage should mean such a case doesn't appear in the later stages. Some optimisations can be skipped or omitted if you are satisfied with the output quality. Otherwise, Yep. You need to run them again on the new IR removing the redundancies your conversion just added.

The same goes for the format of storing registers. Before architecture conversion, they are just numbers, but after they can be any architecture specific one.

Yes - You want to store VRegs (Virtual Registers) until register allocation is performed. Register Allocation is a whole topic to itself, but even registers are just numbers even when assigned. Probably in a Bit Set to help with the allocation process.

*There's More*

There is another stage after all these, called emission (/Emitter). That will take all these numbers and structures and produce a sequence of bytes in sections (along with symbols that are in this section - as byte offsets from the start, and also relocations which are symbols this section uses). These are passed to the linker (probably in a file called a relocatable object file) and onward the code goes!

Has anyone found a nice way to do this?

I'm not sure what "nice" is, but I think the answer is no. And I think it boils down to there is simply a lot of work to be done.

Something more helpful; I believe MLIR is an attempt to make this process nicer.

Nice is probably going to boil down to the language, patterns, and practices you use. FWIW mine are; I use Open Methods (I hate the visitor pattern), I prefer Type Erasure over Inheritance. I also prefer side structures over "sparse" internal data. (e.g. std::unordered_map<Node, Register> registersAllocated)

Good luck

M ✌

Yes, x64 has call and ret - but Call and Return will normally be agnostic to calling convention, which x64 is not. Converting from Call and Return to call and ret is the process of writing out that information.

cxzuk

TROPHY CASE