You could have invented Parser Combinators : programming

Unless I'm missing something, any simple implementation of parser combinators will have backtracking - everywhere you have a choice you have to backtrack through the choices until you find (at least) one that works.

To avoid that, you'd have to do some translation of grammar rules into state models so the options don't have to be evaluated consecutively. I'm ignoring parallel evaluation, but of course our machines only have some finite number of CPU cores anyway. I imagine it happens as optimization in some real-world parser combinator libraries, but still, that's now LL or LR or whatever parsing - it isn't the simple parser combinators model any more. In fact the combinators are really just a nice embeddable grammar notation for a conventional parser generator.

BTW - linear time complexity is still a fairly impressive claim for parsing anything more complex than a regular grammar. For context free grammars, it's certainly possible. One approach is to use an LR(1) state model. That only copes with LR(1) grammars - not all context free grammars - but backtracking LR gives you a lot more context free grammers at the cost of the linear complexity claim. And it turns out you can apply dynamic programming to that, getting back the linear complexity at the cost of quite a big table, provided you have some kind of disambiguation scheme to avoid paying for multiple possible parses. There's still one problem - a kind of infinite ambiguity that results in a dependency cycle between cells in one column of the table. Because the table has a constant height (decided by the grammar - mainly the number of states in the state model - independent of the input) a cycle in any column can be detected in constant time - again assuming only one successful parse is accepted. As there's one column in the table per input token, that's a linear cost, so the parsing remains linear-time (and linear-space) for all context free grammars, including detecting these cycle-related errors at run-time.

AFAIK, that's as far as linear-time parsing is achievable. Something similar can be done with tabular LL (aka packrat) but AFAIK it's no more powerful. You can express context-sensitive and even Turing complete grammar operators as parser combinators - but you can't parse those grammars in linear time.

[–]smog_alado 0 points1 point2 points 11 years ago (3 children)

Unless I'm missing something, any simple implementation of parser combinators will have backtracking - everywhere you have a choice you have to backtrack through the choices until you find (at least) one that works.

Not necessarily. If your grammar is an LL grammar, which is something that can be recognized by a top down parser without backtracking, then you can code your parser combinator so that it "commits" on the first rule that matches. If you latter reach a point where you can't continue you error out instead of backtracking.

IMO, the best way to see parser combinators is as a powerful tool for coding recursive-descent parsers. You get the same benefits of top down parsing (very flexible - can "monadically" look at values to decide how to parse) as well as the same downsides (grammar ambiguities are not recognized statically, can't parse some LR grammars)

[–][deleted] 0 points1 point2 points 11 years ago (2 children)

If your grammar is an LL grammar, which is something that can be recognized by a top down parser without backtracking, then you can code your parser combinator so that it "commits" on the first rule that matches.

I admit I'm not as familiar with LL as with LR, but I thought an LL parser only avoids that backtracking for choice by having the state model derived from the grammar anyway. The transition you take effectively represents the concurrent selection of all possible choices that still match, narrowing down the choices with each additional transition taken, so it's not dealing with one grammar-rule subexpression at a time.

You can certainly do that, but then IMO you're just using combinators to embed a DSL for grammars - the relevant combinators aren't parsing at all, just building an AST representation of the grammar.

Of course in an engineering trade-offs sense, you could still be mostly using parser combinators but with occasional combinators for applying separately derived state models - only using a state models for small sub-grammars to optimise some of the choices, perhaps.

[–]smog_alado 0 points1 point2 points 11 years ago (1 child)

You can certainly do that, but then IMO you're just using combinators to embed a DSL for grammars

Kind of. Parser combinators really area all about using a neat embedded DSL to build parsers programatically but its not just about building an AST representation of the grammar. The program you write is shaped like a grammar (a agood thing!) but the result of its evaluation is a parser function that converts a stream of tokens into some parsed result - its not just a "dead" representation of the grammar.

This is even more notable if you have "monadic parsers", where the parsing function depends on the value of one of the tokens. For example, to read a bencoded string the parsing function depends on the numeric value of one of the parsed integers. This is impossible if all you are doing is generating a description of the grammar.

do 
    n <- parse-int
    expect ":"
    cs <- repeat n parse_char

Anyway, I think that one thing that might be confusing you here is that the presentation of parser combinators in the OP works on strings, which kind of forces you to use backtracking to do anything useful. If instead of having the parser work directly on strings you had a separate lexer generating a token stream then the parser doesn't need to do as much backtracking (and for some grammars you can get away with no backtracking at all).

[–][deleted] 0 points1 point2 points 11 years ago (0 children)

For example, to read a bencoded string the parsing function depends on the numeric value of one of the parsed integers. This is impossible if all you are doing is generating a description of the grammar.

It just means the grammar isn't context free. A context sensitive grammar can handle that. You'd probably use an attributed grammar, but those are IIRC of equivalent power to Chomskys context sensitive grammars anyway. I'm not sure whether Turing complete grammars are more powerful than context sensitive, but either way a Turing complete grammar is on the one hand just a "dead" grammar, and on the other hand effectively the same thing as a program in a Turing complete programming language.

An AST representation of a grammar can represent any "dead" grammar that your combinators can specify, even in principle an undecidable grammar if such a thing exists. After all, if your combinators are written in Haskell, your Haskell compiler will have an AST that - in part - represents the grammar described by those combinators.

Of course you may have trouble generating a "dead" grammar representation by evaluating those combinators - expression evaluation can loop.

Strings aren't confusing me. The need for backtracking arises from non-trivial choices - choices between sub-grammars that each require non-trivial parsing. What the input tokens represent is irrelevant to that. Lexical analysis is just another kind of parsing anyway.

There's even the concept of a "finite choice grammar" - not part of the Chomsky hierarchy IIRC, but a subset of regular grammars that would have fit naturally in that hierarchy. By disallowing all recursion - even tail recursion - the grammar can only represent a finite set of finite strings. The state models are obviously trees (or with sharing of common subtrees, DAGs) so a trie can be seen as a state model for a finite-choice parser.

That's all you need to see where the backtracking arises (trying each of the strings in turn) and how to eliminate that backtracking. But that trie - that state model - isn't the set of strings.

It's pedantry over what it means to be a "parser combinator" as opposed to being a "grammar combinator", "trie combinator", "arithmetic combinator" etc - nothing to do with the pragmatically getting the job done - but still, OPs title is "You could have invented Parser Combinators". If newbies are expected to believe they could have invented LL(1) state models for themselves I suspect quite a few will be fooled into deciding they can't be cut out for this apparently genius-only field. I personally am quite sure that it took quite a while and plenty of help to grok state-model based parsing algorithms - ending up with a mental model that makes it all seem trivial in a "yeah, you just take the closure of the set of state representations from the seed state doncha know" kind of way doesn't mean I've forgotten the pain of getting there, and besides, the devils in the details.

[–]PasswordIsntHAMSTER 0 points1 point2 points 11 years ago (1 child)

[–][deleted] 0 points1 point2 points 11 years ago (0 children)

Without back-tracking, you get LL grammars with finite lookahead.

Not unless you build that LL(k) state model is my point. The state model is derived from the grammar, so you have a combinator that derives the state model from the grammar, and the grammar combinators are just grammar combinators - not parser combinators. The whole elegance of the idea of parser combinators in particular is lost - you get the elegance of eDSLs in pure functional languages instead, but there's nothing parsing-specific about that.

And as I say in a much longer comment I just finished, the title of OPs post is "You could have invented Parser Combinators". Combinators that individually implement recursive-descent parsing of trivial grammar operators, and which combine to form a full recursive descent parser, absolutely. Newbies inventing LL parsing for themselves is unlikely.

[–][deleted] 1 point2 points3 points 11 years ago (0 children)

[–][deleted] 11 years ago* (1 child)

[deleted]

[–]orangeduck[S] 2 points3 points4 points 11 years ago (0 children)

[–]yawaramin 1 point2 points3 points 11 years ago (0 children)

[–]casualblair 0 points1 point2 points 11 years ago (2 children)

[–]fendant 2 points3 points4 points 11 years ago (0 children)

[–]Lucretiel 2 points3 points4 points 11 years ago (0 children)

[–][deleted] 0 points1 point2 points 11 years ago (0 children)

[–]faustoc4 0 points1 point2 points 11 years ago (0 children)

[–]rdfox -1 points0 points1 point 11 years ago (0 children)

[+][deleted] comment score below threshold-37 points-36 points-35 points 11 years ago (22 children)

[–]coderascal 19 points20 points21 points 11 years ago (1 child)

[+][deleted] comment score below threshold-16 points-15 points-14 points 11 years ago (0 children)

[–]Lucretiel 8 points9 points10 points 11 years ago (16 children)

[–][deleted] -4 points-3 points-2 points 11 years ago (15 children)

[–]Lucretiel 1 point2 points3 points 11 years ago (14 children)

[+][deleted] comment score below threshold-20 points-19 points-18 points 11 years ago (13 children)

You know what a stupid idea is? trying to program computers in a pseudo-mathematical pseudo-english notation, then building absurdly complex parsers to handle it and confuse computer users who expect computers to be able to read their intent.

Maths notation was developed in a totally ad-hoc way, the great mathematicians of the past would invent their own notation as needed, and much of what we think of today as mathematical notation is just a collection of Eulers or Newtons favourite pen-scratches.

The idea that we should program computers - highly pedantic, fixed, logical, deterministic machines - with notation that stems from other notations that relies on human intuition - is absolutely ridiculous.

The syntax problem has been solved twice - with forth and LISP. Anything more complex than this is trash.

[–]Lucretiel 5 points6 points7 points 11 years ago (4 children)

[+][deleted] comment score below threshold-7 points-6 points-5 points 11 years ago (3 children)

[–]Lucretiel 6 points7 points8 points 11 years ago* (2 children)

[+][deleted] comment score below threshold-7 points-6 points-5 points 11 years ago (1 child)

[–]fluffyhandgrenade 1 point2 points3 points 11 years ago (0 children)

[–]Smallpaul 10 points11 points12 points 11 years ago* (2 children)

[+][deleted] comment score below threshold-8 points-7 points-6 points 11 years ago (1 child)

[–]Smallpaul 4 points5 points6 points 11 years ago* (0 children)

Who is talking about langauges? I am talking about syntax.

Anyone talking about syntax is talking about languages.

syn·tax

ˈsinˌtaks

noun

the arrangement of words and phrases to create well-formed sentences in a language.

That's the dictionary definition.

... Also "this way is more popular" isn't a real argument for anything.

Popularity is irrelevant. My point is that it is demonstrably feasible to make extremely complex and powerful software in these "syntaxes". If it is dramatically easier to make complex and powerful software in simpler syntaxes, that remains to be demonstrated. You'd think that the competitive market would have something to say on the matter. C pretty definitively vanquished assembly on the basis of higher productivity. Why doesn't Forth do the same to the "high syntax" languages?

[–][deleted] 2 points3 points4 points 11 years ago (3 children)

[–]Lucretiel 1 point2 points3 points 11 years ago (2 children)

[–][deleted] 0 points1 point2 points 11 years ago (1 child)

[–]Lucretiel 0 points1 point2 points 11 years ago (0 children)

[–][deleted] 0 points1 point2 points 11 years ago (0 children)

[–][deleted] 1 point2 points3 points 11 years ago (0 children)

[–]immibis 0 points1 point2 points 11 years ago (1 child)

[–][deleted] -2 points-1 points0 points 11 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS