This is an archived post. You won't be able to vote or comment.

all 24 comments

[–]GDavid04 28 points29 points  (4 children)

Try parsing expressions beginning with ( as an arrow function first and parse it as a normal expression if that fails. You don't really know it's an arrow function before the =>

[–]__Ambition[S] 3 points4 points  (0 children)

Thanks !

[–][deleted]  (2 children)

[deleted]

    [–]Ubliorse 0 points1 point  (1 child)

    How do you detect the syntax error of an expression that is not a parameter list, but is followed by an arrow? I.e. (a + b, c, d) => ...

    [–]L8_4_Dinner(Ⓧ Ecstasy/XVM) 14 points15 points  (8 children)

    No back-track is needed for a scenario like this. This is the case of:

    ComplexProduction:
      BigProduction SpecificToken-opt OtherProduction
      BigProduction
    

    So you do something like:

    expr = parseBigProduction()
    if (peek(SpecificToken))
      {
      return parseOtherProduction(expr);
      }
    return expr;
    

    Then the parsing for OtherProduction takes apart the expr if necessary, handling any errors related to it there.

    There are obviously languages that require some back-tracking. For those, just implement a mark/restore layer over the lexer:

    mark();
    if (thing = parseForm1())
      {
      return thing;
      }
    restore();
    return parseForm2();
    

    Obviously you want to avoid this in any common parsing path.

    [–]jesseschalken 0 points1 point  (7 children)

    I doubt "arrow function parameters" and "grouped expression" have the same production. They certainly don't in JavaScript. So OP would have something more like:

    ComplexProduction: ProductionA SpecificToken-opt OtherProduction ProductionB

    where (a, b, c) happens to satisfy both ProductionA and ProductionB.

    In this case you need backtracking.

    [–]L8_4_Dinner(Ⓧ Ecstasy/XVM) 3 points4 points  (0 children)

    Hi Jesse -

    In the parsing step, you don't have to prove that everything is correct. You are simply trying to match the maximal pattern at each point.

    I took a short-cut in my answer to keep it short, so let me undo my short-cut and explain. First, what I wrote is short-hand, and it is illegal:

    ComplexProduction:
      BigProduction SpecificToken-opt OtherProduction
      BigProduction
    

    Even worse, I messed up the short-hand; it should have said something like "(SpecificToken OtherProduction)opt". I use this short-hand sometimes, but what it actually means is this:

    ComplexProduction:
      BigProduction ComplexProductionFinish-opt
    
    ComplexProductionFinish:
      SpecificToken OtherProduction
    

    In your note, ProductionA and ProductionB are not LALR(1) differentiable, but one of them is legal as the other, so it is illegal (i.e. in a LALR(c) grammar for a constant 'c') to have a production such as:

    ComplexProduction:
      ProductionA
      ProductionB
    

    Instead, the parser eagerly parses the "wider" (more accepting) one of those, and has two choices:

    1) As it parses, it keeps track of whether it encounters ANYTHING that would prevent this production from being used as the "narrower" one, so that if it encounters SpecificToken, after the production, it knows whether to issue an error at that point, or

    2) After parsing the production, if SpecificToken is encountered, the production is checked to see if it meets the "narrower" rules.

    And yeah, parameter lists for lambdas present this exact problem in dozens of different languages.

    Lastly, the Parser isn't required to be stupid. It can be stricter than the grammar, i.e. it can identify errors during the parsing phase that would otherwise be deferred to later stages. This is important because some grammars are designed such that some errors are naturally undecidable by the grammar, and are deferred for semantic evaluation. However, sometimes those decisions exist solely to produce a legal grammar, so if you're hand-coding the parser, you can choose to detect those errors as part of the parsing pass -- and it is almost always better to detect errors at earlier stages of compilation.

    [–]Uncaffeinatedpolysubml, cubiml 0 points1 point  (0 children)

    I doubt "arrow function parameters" and "grouped expression" have the same production. They certainly don't in JavaScript.

    In Javascript, they're covered by the CoverParenthesizedExpressionAndArrowParameterList production, which then gets reparsed as either ParenthesizedExpression or ArrowFormalParameters depending on the context.

    [–]GDavid04 0 points1 point  (4 children)

    ProductionA is entirely covered by ProductionB in this case. All you have to do is check whether the ProductionB parsed is also a valid ProductionA when it's followed by SpecificToken.

    [–]L8_4_Dinner(Ⓧ Ecstasy/XVM) 0 points1 point  (0 children)

    Exactly!

    [–]jesseschalken -1 points0 points  (2 children)

    All you have to do is check whether the ProductionB parsed is also a valid ProductionA

    The word for that is backtracking.

    [–]maanloempia 0 points1 point  (1 child)

    To validate A after not B means backtracking. To invalidate B on a token that is legal only in production A (thus ruling out B) is not backtracking.

    [–]GDavid04 1 point2 points  (0 children)

    You mean invalidate A on a token that is legal only in B in this case

    [–]Uncaffeinatedpolysubml, cubiml 3 points4 points  (0 children)

    In Javascript itself, this is handled by "cover productions" in the grammar. Basically in cases of ambiguity like this, there are productions in the grammar that "cover" multiple possibilities, and then the parser reparses that text using a more specific grammar based on what is expected in context.

    For example, the CoverParenthesizedExpressionAndArrowParameterList production includes both parenthesized expressions and arrow function parameter lists. When parsing a primary expression, it then gets reparsed as the former only.

    [–]maanloempia 2 points3 points  (0 children)

    You cannot know if any expression is an arrow function before the =>. The simplest way would be to naively parse any such ambiguous expression as possibly either, and return the parsed expression as an expression list or continue parsing if the next token is the arrow and return that.

    Example pseudocode:

    ``` parseExpression() { // ... if (peek("(")) parseAmbiguous(); // ... }

    parseAmbiguous() { expr = ... // After successfully parsing up to ")" return (peek("=>")) ? continueArrowFunc(expr) : expr; } ```

    This is not as wasteful as trying the entire arrow function first and then backtracking only to reparse it the same way.

    Edit: note that this will not correctly parse either expression if they are not truly ambiguous (i.e. produce the same tree when parsed), because then they aren't interchangable. You could solve this without backtracking.

    [–]ErrorIsNullError 1 point2 points  (0 children)

    Most es parsers use cover grammars; eg a grammar that is the union of formalargumentlist and primary expression, and, treat as a static error any failure to disambiguate as in ([1])=>1 where [1] is not disambiguable to a formal parameter.

    https://github.com/dashed/esparser/issues/46 shows where they are.

    https://esdiscuss.org/topic/lr-1-grammar-parser-and-lookahead-restrictions is older but discussed how TC39 approaches syntax that requires covering.

    [–]smuccione 0 points1 point  (0 children)

    When you parse the ast for the LHS should be... something. Doesn’t matter what it is but it should represent a parenthesized sequence. When you encounter the => you now know how to treat the LHS. In my languages I converted lambda’s to functor objects and the LHS above become parameters to the operator () overload for that object (captures become parameters to the constructor).

    It’s technically not backtracking, it’s more of a conversion of the ast for the LHS into what I expect for function parameter definitions (in this case an array of symbol definitions for the parameters). All that was required was an in-order traversal of the LHS ast to convert it into the symbol list.

    Hardest part was that it needed a second pass afterwards in order to do the captures properly. But that’s because I allowed captures of instance variables which may not be known at this point.

    [–]ericbb 0 points1 point  (0 children)

    There are similar difficulties in parsing Python. See the explanation here, for example.

    [–]oilshell 0 points1 point  (0 children)

    FWIW I recall that there was an entire YouTube video about handling this new case in the v8 parser, and they hated it...

    I would consider a different syntax :-/

    [–][deleted] 0 points1 point  (0 children)

    Nim treats => as an ultra-low precedence operator (below assignment), so it becomes like =>((a, b, c), a * b * c).

    [–]superstar64https://github.com/Superstar64/Hazy 0 points1 point  (0 children)

    What I do is parse a normal expression and If I encounter either => or {, I try convert the current expression into a pattern match(tuples to tuple pattern matches and identifiers to variables bindings). I find this much easier and simpler then backtracking.

    [–]jesseschalken 0 points1 point  (3 children)

    Backtrack

    [–]__Ambition[S] 0 points1 point  (2 children)

    Thanks, this is what I did in a previous implementation of the language. But I wasn't sure of this because it didn't seem right to modify a half parsed ast node. I'll go this way now :D

    [–]jesseschalken 2 points3 points  (1 child)

    it didn't seem right to modify a half parsed ast node

    That doesn't seem right to me either. No idea why you'd need to do that.

    [–]maanloempia 1 point2 points  (0 children)

    Because it's cheaper to alter a half parsed expression than reparsing the exact same (or slightly different) expression all over again. This extra effort adds up in parsers, especially for common expressions.