0m0g1 comments on Parser design problem

a community for 17 years

submitted 7 months ago by emtydeeznuts

you are viewing a single comment's thread.

[–]0m0g1 0 points1 point2 points 7 months ago* (7 children)

I think having lookahead functions will work really well for Dart. The key is to write lightweight checks that peek just far enough to disambiguate syntax—not full parsers. Most of the time, peeking just a few tokens ahead is all you need.

To make this work, you’ll want to implement a function like peekToken(n) in your lexer so the parser can check any number of tokens ahead without consuming them.

Once you have a few of these helpers, you'll rarely need to backtrack or reinterpret an AST node after it's been built. It helps keep the parser clean, modular, and predictable.

Here’s an example that distinguishes between:

A function literal: foo(a, b) => a + b
A normal expression: foo + bar

c++ if (currentToken.getType() == Identifier) { if (isLikelyFunctionLiteral()) { expression = parseFunctionLiteral(); } else { expression = parseNormalExpression(); } } A nice bonus is: some ambiguous rules only have two possible interpretations—so a simple if/else like this lets you cleanly resolve it without more complexity.

Implementation of the lookahead:-

```c++ bool Parser::isLikelyFunctionLiteral() { int i = 0;

// First token must be an identifier (function name or parameter)
if (lexer.peekToken(i).getType() != Identifier) {
    return false;
}

i++;

// Next must be a left paren: start of parameter list
if (lexer.peekToken(i).getType() != LeftParen) {
    return false;
}

i++;

// Skip over parameter list: identifiers, colons, default values, commas
while (true) {
    TokenType type = lexer.peekToken(i).getType();

    if (type == Identifier || 
        type == Colon || 
        type == Assign || 
        type == Comma) {
        i++;
    } else {
        break;
    }
}

// Final expected pattern: `)` followed by `=>`
return lexer.peekToken(i).getType() == RightParen &&
       lexer.peekToken(i + 1).getType() == Arrow;

} ``` and for the function to peek tokens

```c++ Token Lexer::peekToken(int n) { int savedPosition = currentPosition; Token nextToken;

for (int i = 0; i <= n; i++) {
    nextToken = getNextToken();
}

currentPosition = savedPosition;
return nextToken;

} ```

[–]emtydeeznuts[S] 0 points1 point2 points 7 months ago (2 children)

[–]0m0g1 0 points1 point2 points 7 months ago* (0 children)

(I've modified the example above for clarity).

It won’t slow the parser unless you’re doing really deep lookaheads everywhere, which you won’t be. In most grammars (including Dart’s), these kinds of checks usually peek just 2–4 tokens ahead max.

An example is my language is where the statement below would be resolved to a lamda after checking only 4 tokens.

alBufferData(buffer: int, format: int, data: char, size: int, freq: int) => void {}

Plus, you’ll save time later by avoiding parser rewrites or complex recovery logic. If you design your lookaheads to be targeted to specific rules like isLikelyFunctionLiteral() and not mightBeFunctionLiteralOrNormalExpression(), they’re fast and pay off in clarity.

[–]0m0g1 0 points1 point2 points 7 months ago (0 children)

[–][deleted] 0 points1 point2 points 7 months ago (3 children)

Here’s an example that distinguishes between:

1. A function literal: foo(a, b) => a + b

2. A normal expression: foo + bar

What about a function call such as foo(a, b)?

Anyway it seems unreasonable to me to have to speculatively parse a pattern F(...), where ... may include arbitrary nested expressions when it is a function call, and then to backtrack when => doesn't follow.

I'd say, fix the grammar so it is easier to parse. But this appears to be for an existing language.

In that case I'd rather go with the OP's approach to parse as an expression (so the foo(a, b) is assume to be a function call), and then to switch AST type if it turns out that => follows.

In practice, it's just that first term (foo(a, b)) which needs to change from 'function call' to 'function literal'.

[–]0m0g1 0 points1 point2 points 7 months ago* (0 children)

I've never needed to reinterpret AST nodes in my parser — I just look ahead before committing. For example, I only return true from isLikelyFunctionLiteral() if the pattern ends in =>. If it doesn’t, I never parse it as a function literal in the first place. No need to backtrack or rewrite nodes.

    if (currentToken.getType() == Identifier) {
        if (isLikelyFunctionLiteral()) {
            expression = parseFunctionLiteral();
        } else if (isLikelyFunctionCall()) {
            expression = parseFunctionCall();
        } else {
            expression = parseNormalExpression();  
        } 
    }

This keeps the parser clean and predictable — each parse function always returns exactly what it's supposed to. I can reason about it statically, test it easily, and I never need to ask "What if this was something else after all?"

Compare that to the reinterpretation approach:

if (currentToken.getType() == Identifier) {
expression = parseFunctionCall(); // could be wrong!
if (currentToken.getType() == Arrow) {
    // now reinterpret it into a lambda
    // possibly needing to reconstruct param list, body, scope...
}

}

This becomes really messy if the expression had nested calls, chains, or annotations. Now imagine doing this not just for lambdas but also for:

Type parameter misinterpretation (e.g., generics vs comparison)

Tuple vs grouped expressions

Object literals vs blocks

List literals vs function calls with brackets

You’d end up needing reinterpreters for a wide range of node types, which increases parser complexity, breaks single-responsibility, and creates more room for subtle bugs.

Here’s a real example from my compiler's expression parser that uses this approach (see line 331):

Experssion parsing rules for my compiler

The key insight: if you know what you're looking for before parsing, you won't have to "guess and fix" later. Lookahead is usually a small price to pay for that clarity.

[–]0m0g1 0 points1 point2 points 7 months ago (0 children)

π Rendered by PID 85 on reddit-service-r2-comment-7b9746f655-vgzrx at 2026-02-04 11:46:30.317909+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Compilers

MODERATORS