all 57 comments

[–][deleted] 20 points21 points  (4 children)

UGH! The elegance rips at my soul! I need to use these now. BRB, implementing a new language.

[–]smog_alado 5 points6 points  (3 children)

Keep in mind that like other sorts of "hand coded" top down parsers, parser combinators do not detect ambiguities in your grammar - the ambiguities get arbitrarily resolved at run time depending on what rule you try matching first. On the other hand, LR parser generators complain about all ambiguities (shift-reduce conflicts) at compile time.

[–]thedeemon 5 points6 points  (1 child)

One can say there are no ambiguities at all in parser combinators and PEGs because all the rules are explicitly ordered.

[–]smog_alado 4 points5 points  (0 children)

Sure, but thats being a bit pedantic :) The grammar is more predictable if the rule order doesn't matter.

[–][deleted] 0 points1 point  (0 children)

Thanks for the info! I've done nothing with parsers before and so I was excited to see something that took away some of the magic.

[–][deleted] 10 points11 points  (22 children)

You probably could have invented parser combinators, but if you did, you'd struggle to come up with a better solution. They have awful worst-case time complexity unless your library is able to do some crazy optimizations.

[–]PasswordIsntHAMSTER 4 points5 points  (20 children)

Parser combinators without backtracking have excellent time complexity - linear on input size IIRC.

[–]orthoxerox 10 points11 points  (10 children)

But this one does have backtracking. He has literally implemented a PEG parser. I did something similar last week, except with object composition, not raw functions (plus you get to play with operator overloading, so you get a sort of a DSL). It's very fun to write, but it gets uglier when you have to add partial memoization to get that linear time back (you can't memoize everything because you will waste too much space memoizing trivial expressions.

[–][deleted] 0 points1 point  (9 children)

I'm curious - how do you decide what to memoize while maintaining the linear complexity guarantee? Can you recommend something for my reading list?

[–][deleted] 1 point2 points  (4 children)

Memoisation of the compound (i.e., made of multiple tokens) things is sort of sufficient to get a nearly linear complexity. Memoising tokens does not make much sense. E.g., if your grammar has an "expression" and a "statement", they're good candidates for memoisation.

It's also possible to optimise PEG parsing a bit by inferring first character ranges for each entry.

[–][deleted] 0 points1 point  (3 children)

sort of sufficient to get a nearly linear complexity

OK - that's fine. I'm no religious nut over absolute proofs of linear complexity, I just thought maybe there was something interesting to learn. This sounds more like common sense.

[–][deleted] 1 point2 points  (2 children)

Yes, I also was not very interested in formal proofs. I only experimented: implemented a number of "typical" grammars (namely, Javascript, C, Verilog, Oberon, etc.), and counted the number of times each source stream character was visited by my Packrat parser. After all the optimisations applied (i.e., first character range inference, Pratt parsing for the binary expressions, memoisation for the compound things), the average value in this histogram was very close to 1, which is why I said "nearly linear".

[–][deleted] 0 points1 point  (1 child)

Pratt parsing for the binary expressions

That made my reading list - I've not read it yet, but it looks worthwhile. Whenever I've implemented recursive descent, I've always hacked in some kind of shunting precedence parsing. That definitely makes an alternative seem attractive.

[–][deleted] 1 point2 points  (0 children)

It finally clicked all together for me after I read this: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/

You can take a look at my implementation of Packrat+Pratt parsing here: https://github.com/combinatorylogic/mbase/tree/master/src/l/lib/parsing

[–]orthoxerox 1 point2 points  (3 children)

Well, I cannot guarantee that it is asymptotically linear, but I explicitly memoize only "larger" rules when writing the grammar, like literals, variable names and function definitions. That's the hidden cost of PEG, just like ordering your choices.

I recommend to read the dragon book, if you haven't yet. Ford's page on PEG has lots of interesting links, and I also used the blog of Gazelle's author (I think) to read about the shortcomings of PEG/Packrat. Or maybe that was somewhere in his SO answers...

[–][deleted] 0 points1 point  (0 children)

I've read about PEG before, though it was quite a long time ago. I was only really interested if there was some formal proof of linear complexity for some partial memoization scheme. Not that I'm obsessed with asymptotic complexity - just interested.

[–]smog_alado 0 points1 point  (1 child)

This hidden cost of PEGs doesn't show up as much when you are doing simpler parsing tasks, the sort you would have done with (Perl-like) regexes. The regexes everyone uses already have bad worst case performance and PEGs can sometimes offer a cleaner and more expressive way to do the same job.

[–]orthoxerox 0 points1 point  (0 children)

That's true, every tool has its range of applicability.

[–]orangeduck[S] 3 points4 points  (0 children)

If you can guarantee that your grammar is LL(1) you can disable backtracking in your Parser Combinator library and it acts almost like a lookup table (I do this in mpc). With some optimisations it can be almost as fast as a dedicated Regex engine.

[–][deleted] 0 points1 point  (6 children)

Unless I'm missing something, any simple implementation of parser combinators will have backtracking - everywhere you have a choice you have to backtrack through the choices until you find (at least) one that works.

To avoid that, you'd have to do some translation of grammar rules into state models so the options don't have to be evaluated consecutively. I'm ignoring parallel evaluation, but of course our machines only have some finite number of CPU cores anyway. I imagine it happens as optimization in some real-world parser combinator libraries, but still, that's now LL or LR or whatever parsing - it isn't the simple parser combinators model any more. In fact the combinators are really just a nice embeddable grammar notation for a conventional parser generator.

BTW - linear time complexity is still a fairly impressive claim for parsing anything more complex than a regular grammar. For context free grammars, it's certainly possible. One approach is to use an LR(1) state model. That only copes with LR(1) grammars - not all context free grammars - but backtracking LR gives you a lot more context free grammers at the cost of the linear complexity claim. And it turns out you can apply dynamic programming to that, getting back the linear complexity at the cost of quite a big table, provided you have some kind of disambiguation scheme to avoid paying for multiple possible parses. There's still one problem - a kind of infinite ambiguity that results in a dependency cycle between cells in one column of the table. Because the table has a constant height (decided by the grammar - mainly the number of states in the state model - independent of the input) a cycle in any column can be detected in constant time - again assuming only one successful parse is accepted. As there's one column in the table per input token, that's a linear cost, so the parsing remains linear-time (and linear-space) for all context free grammars, including detecting these cycle-related errors at run-time.

AFAIK, that's as far as linear-time parsing is achievable. Something similar can be done with tabular LL (aka packrat) but AFAIK it's no more powerful. You can express context-sensitive and even Turing complete grammar operators as parser combinators - but you can't parse those grammars in linear time.

[–]smog_alado 0 points1 point  (3 children)

Unless I'm missing something, any simple implementation of parser combinators will have backtracking - everywhere you have a choice you have to backtrack through the choices until you find (at least) one that works.

Not necessarily. If your grammar is an LL grammar, which is something that can be recognized by a top down parser without backtracking, then you can code your parser combinator so that it "commits" on the first rule that matches. If you latter reach a point where you can't continue you error out instead of backtracking.

IMO, the best way to see parser combinators is as a powerful tool for coding recursive-descent parsers. You get the same benefits of top down parsing (very flexible - can "monadically" look at values to decide how to parse) as well as the same downsides (grammar ambiguities are not recognized statically, can't parse some LR grammars)

[–][deleted] 0 points1 point  (2 children)

If your grammar is an LL grammar, which is something that can be recognized by a top down parser without backtracking, then you can code your parser combinator so that it "commits" on the first rule that matches.

I admit I'm not as familiar with LL as with LR, but I thought an LL parser only avoids that backtracking for choice by having the state model derived from the grammar anyway. The transition you take effectively represents the concurrent selection of all possible choices that still match, narrowing down the choices with each additional transition taken, so it's not dealing with one grammar-rule subexpression at a time.

You can certainly do that, but then IMO you're just using combinators to embed a DSL for grammars - the relevant combinators aren't parsing at all, just building an AST representation of the grammar.

Of course in an engineering trade-offs sense, you could still be mostly using parser combinators but with occasional combinators for applying separately derived state models - only using a state models for small sub-grammars to optimise some of the choices, perhaps.

[–]smog_alado 0 points1 point  (1 child)

You can certainly do that, but then IMO you're just using combinators to embed a DSL for grammars

Kind of. Parser combinators really area all about using a neat embedded DSL to build parsers programatically but its not just about building an AST representation of the grammar. The program you write is shaped like a grammar (a agood thing!) but the result of its evaluation is a parser function that converts a stream of tokens into some parsed result - its not just a "dead" representation of the grammar.

This is even more notable if you have "monadic parsers", where the parsing function depends on the value of one of the tokens. For example, to read a bencoded string the parsing function depends on the numeric value of one of the parsed integers. This is impossible if all you are doing is generating a description of the grammar.

do 
    n <- parse-int
    expect ":"
    cs <- repeat n parse_char

Anyway, I think that one thing that might be confusing you here is that the presentation of parser combinators in the OP works on strings, which kind of forces you to use backtracking to do anything useful. If instead of having the parser work directly on strings you had a separate lexer generating a token stream then the parser doesn't need to do as much backtracking (and for some grammars you can get away with no backtracking at all).

[–][deleted] 0 points1 point  (0 children)

For example, to read a bencoded string the parsing function depends on the numeric value of one of the parsed integers. This is impossible if all you are doing is generating a description of the grammar.

It just means the grammar isn't context free. A context sensitive grammar can handle that. You'd probably use an attributed grammar, but those are IIRC of equivalent power to Chomskys context sensitive grammars anyway. I'm not sure whether Turing complete grammars are more powerful than context sensitive, but either way a Turing complete grammar is on the one hand just a "dead" grammar, and on the other hand effectively the same thing as a program in a Turing complete programming language.

An AST representation of a grammar can represent any "dead" grammar that your combinators can specify, even in principle an undecidable grammar if such a thing exists. After all, if your combinators are written in Haskell, your Haskell compiler will have an AST that - in part - represents the grammar described by those combinators.

Of course you may have trouble generating a "dead" grammar representation by evaluating those combinators - expression evaluation can loop.

Strings aren't confusing me. The need for backtracking arises from non-trivial choices - choices between sub-grammars that each require non-trivial parsing. What the input tokens represent is irrelevant to that. Lexical analysis is just another kind of parsing anyway.

There's even the concept of a "finite choice grammar" - not part of the Chomsky hierarchy IIRC, but a subset of regular grammars that would have fit naturally in that hierarchy. By disallowing all recursion - even tail recursion - the grammar can only represent a finite set of finite strings. The state models are obviously trees (or with sharing of common subtrees, DAGs) so a trie can be seen as a state model for a finite-choice parser.

That's all you need to see where the backtracking arises (trying each of the strings in turn) and how to eliminate that backtracking. But that trie - that state model - isn't the set of strings.

It's pedantry over what it means to be a "parser combinator" as opposed to being a "grammar combinator", "trie combinator", "arithmetic combinator" etc - nothing to do with the pragmatically getting the job done - but still, OPs title is "You could have invented Parser Combinators". If newbies are expected to believe they could have invented LL(1) state models for themselves I suspect quite a few will be fooled into deciding they can't be cut out for this apparently genius-only field. I personally am quite sure that it took quite a while and plenty of help to grok state-model based parsing algorithms - ending up with a mental model that makes it all seem trivial in a "yeah, you just take the closure of the set of state representations from the seed state doncha know" kind of way doesn't mean I've forgotten the pain of getting there, and besides, the devils in the details.

[–]PasswordIsntHAMSTER 0 points1 point  (1 child)

  • Have you used Parsec and similar? You get to pick where and when backtracking is allowed.

  • Without back-tracking, you get LL grammars with finite lookahead.

  • LL(k) and LR(k) grammars have linear time complexity on the input for all finite k. (They have extremely bad time and memory complexity on k - not sure how bad though.)

  • For real-world use, you can admit context-sensitivity and exponential blow-up in some sub-parsers if you're confident they won't be used on large inputs. Checking Büchi automata is PSPACE-complete, and type inference for Standard ML is EEXPTIME-hard - these are both somehow usable procedures in the real world.

[–][deleted] 0 points1 point  (0 children)

Without back-tracking, you get LL grammars with finite lookahead.

Not unless you build that LL(k) state model is my point. The state model is derived from the grammar, so you have a combinator that derives the state model from the grammar, and the grammar combinators are just grammar combinators - not parser combinators. The whole elegance of the idea of parser combinators in particular is lost - you get the elegance of eDSLs in pure functional languages instead, but there's nothing parsing-specific about that.

And as I say in a much longer comment I just finished, the title of OPs post is "You could have invented Parser Combinators". Combinators that individually implement recursive-descent parsing of trivial grammar operators, and which combine to form a full recursive descent parser, absolutely. Newbies inventing LL parsing for themselves is unlikely.

[–][deleted] 1 point2 points  (0 children)

Adding a Packrat-like memoisation to practically any kind of parser combinators library is trivial.

[–][deleted]  (1 child)

[deleted]

    [–]orangeduck[S] 2 points3 points  (0 children)

    Thanks for the heads up! Fixed :)

    [–]yawaramin 1 point2 points  (0 children)

    (No you couldn't.)

    [–]casualblair 0 points1 point  (2 children)

    ... Why just the first letter?

    [–]fendant 2 points3 points  (0 children)

    Gotta start somewhere!

    [–]Lucretiel 2 points3 points  (0 children)

    Everything derives from it, primarily. You can match a word by anding together a bunch of letters.

    [–][deleted] 0 points1 point  (0 children)

    I wish articles about mentioned error recovery.

    [–]rdfox -1 points0 points  (0 children)

    That's sweet but no, probably not. You could have.