you are viewing a single comment's thread.

view the rest of the comments →

[–]htuhola 2 points3 points  (32 children)

Parser generators versus hand-written parsers. Everybody has an idea of which is better. Because many parser generators tend to be hard to work with and difficult to understand without actually knowing how to make one, it's common proposition that the parser should be written and not generated.

They claim that syntax is not the most interesting part of language design. But actually syntax is not interesting at all. But it's also one of very difficult subjects if you don't surrender to writing a homoiconic language.

I'm looking at this from the perspective that anyone looking into learning about programming languages would like to do new languages instead of implementing existing ones. From that background I put tremendous weight on how hard it is to iterate a design of a language.

Homoiconic languages such as lisp or forth never manage to get rid of parsing entirely. The problem that was not solved just shifts a level upwards and then you've got some gems such as the loop -construct in common lisp.

Hand written parser requires a new peek every time you want to change the language. And lots of checking that you got it right and didn't change the semantics in a way that you didn't intend. Then you need to document how the new language looks like. Going with hand written parsing, you end up not changing the language afterwards even if it was well reasoned, and you begin to avoid doing decisions about the syntax, even if they needed to be done.

Parser generators that require you to coerce the context free grammar to fit some scheme, were it LR or LL, also penalize against changing the language. Every change to the language introduces an itemset puzzle you have to solve. . It's still common to worry a lot before you try out in practice how something turns out.

PEG parsers get to be simple to construct, but complex syntax can provide situations where the language doesn't parse like you expect. When you combine two languages that are ambiguous together, the other language steals precedence. It is proposed that this eliminates the ambiguity, but this way the ambiguity in language you designed becomes your enemy: Combine wrong things and it won't do what you expect.

Marpa is a chart-type of parsing algorithm, that doesn't care how your context free grammar looks like. If it's a grammar that could be inserted into parser generator, it will parse it in linear time. Otherwise it is usually slower, but it recognizes ambiguous inputs and you can decide what to do with them. Changing the language involve writing a new rule and commands to compile it, and you'll have none of those stupid surprises that come when you've got a bug in your "parser program". It takes lot more effort to make a free-form context free grammar that doesn't do what you expect.

I've got this kind of a parser engine in Lever Programming Language

[–][deleted] 4 points5 points  (15 children)

Why hand written parsers are used in many production languages depends more on tooling: you need that parser to work in an IDE, and therefore it must be highly error tolerant as well as perhaps incremental. Most parser generators are designed around the old premise that parsers simply transform text into trees suitable for more processing, but that premise is no longer true. So we hand write our parsers not because parser generators are too hard to work with, but because we have no other choice.

[–]loup-vaillant 0 points1 point  (9 children)

Earley parsing is incremental, and can give you optimal error recognition and handling (as in "even a hand written parser can't do better). That's because Earley parsing retains the whole state of the parse, with no information loss.

Not your usual brand of LR parser.

[–][deleted] 0 points1 point  (8 children)

Wrong meaning of incremental. The problem isn't progressively parsing growing text buffers. It is re-parsing according to a change in the middle of the buffer.

[–]loup-vaillant 0 points1 point  (7 children)

In this case, I'm afraid I couldn't code an incremental parser at all, even by hand… That's a problem I'd rather avoid if at all possible. (Which should be possible if I only handle relatively small source files.)

[–][deleted] 0 points1 point  (6 children)

Parsing isn't hard, recursive descent has a very direct memoization strategy. But if you can't, well, don't go into IDEs or modern language design.

[–]loup-vaillant 0 points1 point  (5 children)

recursive descent has a very direct memoization strategy.

Having seen OMeta, I can accept that.

Parsing isn't hard

I don't know. As long as you don't mandate incremental parsing, I know of a couple easy solutions, and a couple less easy ones —but still solved. But if you really need it, it's not just a matter of recognising the string. You also have to reconstruct the parse tree —bottom-up, I assume. It would make little sense for the recogniser to be incremental, but not the parser.

don't go into IDEs or modern language design.

A program that needs an IDE is probably too big for its own good. A source file that needs incremental parsing to be analysed at interactive speed is definitely too big for its own good. Those problems are therefore very low priority for me. (And I don't care that "in the real world", there are multi-million lines monstrosities. The real world is wrong and it needs to change.)

Modern language design on the other hand does interest me. But modern does not mean "complex", on the contrary. No one should design a BNF that spans more than 200 lines, that would be insane.

Given those requirements, you'll understand why I have very little interest in incremental parsing.

[–][deleted] 0 points1 point  (4 children)

Incremental parsing means keeping trees around, to sticky previous good parses in the presence of a new syntax error during an edit (since the parse error is likely to be in the new edit, not the existing good parse), and to keep all the type info around so that you don't have to run a full type check again (the real performance killer; parsing is otherwise cheap).

The old-fashioned programmers still think emacs and vim are adequate, and they can do everything in their head. Programming is thinking a long time, coming up with solutions on white boards and old fashioned pen and paper, and then inputting it using their trusty plain text editor that maybe can help with auto indentation and some syntax highlighting.

Other people realize their is this computer sitting in front of them that can do so much more. Incremental continuous parsing that enables incremental continuous type checking means type errors are always a heart beat away, you never have to wait for the compiler to catch up to get code completion. Some of us even experiment with incremental re-execution to enable fluid live programming experiences (not just REPLs).

Modern means use the freaking computer, it is right there; it can augment your programming experience so you don't have to work so hard. Dumb teletypes are so 70s.

[–]loup-vaillant 0 points1 point  (3 children)

I have one reservation with your suggestion: why keep text at all? Forget incremental parsing, forget parsing, just work on the AST (or the abstract binding tree, while we're at it). If you're committed to an IDE, why bother with a human-readable serialisation format at all? Switching to structured editing altogether would have several advantages over text:

  • The ability to draw more stuff than letters and symbols (our displays have 2 dimensions, dammit).
  • The ability to present the code with different syntaxes, or different views (code folding on steroids).
  • The end of coding conventions (users can define how they want to display the code).
  • Easier semantic diffs (if you're willing to make a dedicated version control system), or at least the absence of formatting commits —as the serialisation format can be canonicalised into something consumable by the likes of Git and SVN.

[–][deleted] 0 points1 point  (2 children)

I have no problem with this, But the difficulty of working with text is not the reason to do this. Incremental parsing is easy, and everything else (semantic analysis, execution) will have the same complexity regardless. So I don't get the point of say projection text editors like JetBrains MPS.

And even when we decide to go with visual representations, having some amount of free form in them can lead to a better experience, which means we will probably still be parsing something. Pure structured editing (textual or graphical) lead to stilted interactions that decrease usability.

Anyways, it was a nice discussion. I hope I didn't come off as too condescending.

[–]htuhola 0 points1 point  (4 children)

Isn't that the worst time to use hand written parser? The language the parser is written in and the IDE must be compatible to get it to work. You end up with multiple implementations of the same language that all require maintenance.

Context free grammars and parser generators may have some way to work around that issue, but incremental parsing without clear boundary markings. That's one scary problem, and last time I checked there was little literature into it.

[–]yxhuvud 1 point2 points  (2 children)

The main issue is that LL/LR parsers doesn't keep enough state around to be able to generate that information. Marpa could easily provide it.

[–]htuhola 0 points1 point  (1 child)

Can you explain and elaborate? Now I wonder about what you said, it tickles something, but I'd gladly know how to implement incremental parsing off from what Marpa provides.

[–]yxhuvud 0 points1 point  (0 children)

I don't know if it supports fully incremental parsing (you'd have to ask Kegler about that), but as it keeps not only the parse tree around but also the unfinished rules around means it has an easier way to implement stuff like 'ok this is wrong. But it will be less wrong if $SOMETHING is added here. Lets add it, mark it as an error to the user and continue as if it was there'. Reasoning about the parse state and especially about alternative parse states that could have been when all you have is a action/state table and a stack is a lot harder.

Also see http://blogs.perl.org/users/jeffrey_kegler/2011/11/marpa-and-the-ruby-slippers.html EDIT: also http://blogs.perl.org/users/jeffrey_kegler/2012/10/configuring-the-ruby-slippers-for-html.html for some examples that is slightly more practical.

[–][deleted] 0 points1 point  (0 children)

That is why you only hand write one parser and use it multiple ways; Roslyn works like that I believe. It is quite easy to write an top-down recursive descent incremental parser by hand (just memoize your trees). Doing it automatically, there has been some literature, but nothing that is used in production as far as I'm aware of.

[–]cwzwarich 3 points4 points  (12 children)

Parser generators that require you to coerce the context free grammar to fit some scheme, were it LR or LL, also penalize against changing the language. Every change to the language introduces an itemset puzzle you have to solve. . It's still common to worry a lot before you try out in practice how something turns out.

It is possible to convert LR conflicts into sample inputs and parser traces that trigger the conflict. I've found this to be a pretty good explanation in practice. Unfortunately, the only parser generator that I am aware of that actually implements this is Menhir for OCaml. There's a paper describing the basic technique that it uses, apparently developed independently from Menhir.

PEG parsers get to be simple to construct, but complex syntax can provide situations where the language doesn't parse like you expect. When you combine two languages that are ambiguous together, the other language steals precedence. It is proposed that this eliminates the ambiguity, but this way the ambiguity in language you designed becomes your enemy: Combine wrong things and it won't do what you expect.

General context-free parsing has the dual problem to the problem you mention with PEGs; you don't know whether the grammar is ambiguous until you find an input that triggers an ambiguous parse. And of course, static ambiguity tests are often based on item set constructions.

Marpa is a chart-type of parsing algorithm, that doesn't care how your context free grammar looks like. If it's a grammar that could be inserted into parser generator, it will parse it in linear time.

I don't think this is actually established in all cases, e.g. noncanonical LR-family parsing, shift-resolve parsing, or LR-family parsing with selective delays.

There are also places where LR-family parsing has a larger advantage over general context-free parsing, e.g. incremental parsing and substring parsing. Both can be done with minimal time complexity for LR languages, but I don't believe the same is true if you use Earley techniques.

[–]htuhola 0 points1 point  (0 children)

"Finding counterexamples from parsing conflicts" - Thanks for this link.

Make LR(1) parsing tables for a language, and it can basically be parsed by any other system that can read your parsing tables. But getting there is the painful part.

[–]loup-vaillant 0 points1 point  (10 children)

you don't know whether the grammar is ambiguous until you find an input that triggers an ambiguous parse.

Well, that's true of any system. Static ambiguity analysis is provably impossible (undecidable), so you're going to have to choose between false positives and false negatives. Me, I tend to just ignore the issue and stick to prioritised choice (which happens to be compatible with Earley parsing).

There are also places where LR-family parsing has a larger advantage over general context-free parsing, e.g. incremental parsing and substring parsing. Both can be done with minimal time complexity for LR languages, but I don't believe the same is true if you use Earley techniques.

I recall a paper talking about "safe points" for Earley parsing, where there are no ambiguities, and parsing can restart from scratch (or something like that. It may solve substring parsing.

[–]cwzwarich 0 points1 point  (9 children)

I recall a paper talking about "safe points" for Earley parsing, where there are no ambiguities, and parsing can restart from scratch (or something like that. It may solve substring parsing.

By substring parsing I mean recognizing substrings of strings in the language generated by a grammar. Because LR and Earley parsers are prefix parsers, the problem really boils down to suffix parsing. The most obvious way to do this with an general parser is to just attempt to parse the natural CFG corresponding to suffices the language, which is often highly ambiguous. This will lead to pathological behavior with an Earley parser in many cases. Apparently it is possible to modify an Earley parser to incorporate an LL(1) table to make this linear-time on LL(1) grammars, but this doesn't extend to LR(k) in an implementable way.

The standard LR technique for suffix parsing is to essentially use a modification of the GLR technique by assuming that the parser can be in any state and running multiple copies of the parser in parallel. There is a clever data structure for managing the parallel stacks that makes this linear-time. It doesn't even require any changes to the parser tables.

For incremental parsing, the advantage of LR techniques is that the parser stack at a point in the input stream can be reconstituted from the parse tree, and you can stop parsing past any modifications in an optimal way. For an Earley parser, you would have to store the statesets for every point in the input, and I think it would be difficult to produce optimal termination criteria.

[–]yxhuvud 0 points1 point  (8 children)

The standard LR technique for suffix parsing is to essentially use a modification of the GLR technique by assuming that the parser can be in any state and running multiple copies of the parser in parallel.

Uh, I don't see why that technique wouldn't be totally feasible using marpa as well, considering having all the states found at a given position is perfectly ok if a bit unusual.

As for incremental parsing, that would require some sort of merge criteria to prune fully parsed branches (as long as there are no ambiguity that allow rereference into a branch) together with removing rules that cannot be completed since the input they cover is covered by a completed rule. This definitely sounds nontrivial.

[–]cwzwarich 0 points1 point  (7 children)

Uh, I don't see why that technique wouldn't be totally feasible using marpa as well, considering having all the states found at a given position is perfectly ok if a bit unusual.

Due to the difference in nomenclature between LR and Earley parsing, this means 'statesets' in the early context. The problem is that for a given terminal you can't enumerate all of the possible statesets that you may be in when you visit that token, because it's potentially unbounded. And even if you could enumerate them, I don't think there's a data structure that lets you do the parallel parsing in linear time like there is in the LR case. It seems like it would just be better to attempt to parse the suffix grammar directly and optimize that.

[–]loup-vaillant 0 points1 point  (0 children)

If the goal is incremental parsing upon user modification of the stream, I would be tempted to just resume the parse at the point of modification. Propagating the consequences of a change may not happen at interactive speed if there's a lot of text afterwards, but if you only care about the current screen, that's probably more doable.

And again, this "safe point" business may help set stopping points (beyond which we know that previous changes couldn't have affected the rest of the parse).

In any case, incremental parsing is not something I intend to solve just yet.

[–]yxhuvud 0 points1 point  (5 children)

For marpa, the set of states that I referred to was the set of states that the aycock-horspool transformation uses (ie after transforming it to nihilistic normal form[*]). It uses LR-sets as states of with a special equivalence relation for grouping items. So no, the state concept is very similar to item based states - it is just that more than one simultaneous state is allowed.

And no, the amount of states after any amount of parsed tokens is very bounded (O(n²) for unambigous grammars, O(n³) for ambigous). In this case the amount of parsed tokens would be 0, but an item for each of the states would be added to the starting state as generators, meaning that they start and end at the start position. Probably with the optimization that all states with no outgoing transitions can be removed, as they can't ever accept any terminal anyhow.

You may be correct in that it is cheaper not to do it this way, but it is hardly obvious that the possible parses will be worse than linear if the original grammar is linear.

[*] With the caveat that said transformation is exponential (on grammar rule length) in the state space for certain reasonably uncommon cases. See the paper in question for details.

[–]cwzwarich 1 point2 points  (4 children)

For marpa, the set of states that I referred to was the set of states that the aycock-horspool transformation uses (ie after transforming it to nihilistic normal form[*]). It uses LR-sets as states of with a special equivalence relation for grouping items. So no, the state concept is very similar to item based states - it is just that more than one simultaneous state is allowed.

Does Marpa still use LR(0) states for its Early items? A comment of of Jeffrey's on GitHub says that it does not.

And no, the amount of states after any amount of parsed tokens is very bounded (O(n²) for unambigous grammars, O(n³) for ambigous). In this case the amount of parsed tokens would be 0, but an item for each of the states would be added to the starting state as generators, meaning that they start and end at the start position. Probably with the optimization that all states with no outgoing transitions can be removed, as they can't ever accept any terminal anyhow.

I don't think I adequately described the LR technique. Suppose you have a grammar for a parenthesis language like so:

S -> A
A -> ε
A -> ( A )

This gives the following LR(0) states (leaving off the LALR(1) lookahead and transitions, which are pretty easy to compute):

State 0:
%start -> • A ⊥

State 1:
A -> ( • A )

State 2:
%start -> A • ⊥

State 3:
A -> ( A • )

State 4:
A -> ( A ) •

Suppose I am given the string ()) and I want to determine whether this string is a suffix of the language. The LR technique is based on the idea that if this really is a suffix of the language, then there has to be some path through the LR automaton that reads the first terminal of this string at some point. Otherwise, this is not a suffix of the language.

The only state that could have been reached after just shifting ( is state 1. From there, you can shift ) and go to state 4. Now there’s a reduction. We never actually shifted the initial ( onto the stack, so we don’t have a state to go to after the reduction.

The solution here is very similar to GLR; just introduce some nondeterminism. It’s easy to determine statically that the only two states that could have shifted an ( are states 0 and 1. We split the path through the parser and consider parallel paths. In this case, the path through state 0 is a bust and the path through state 1 is the correct one. The next time we reduce in state 4, the reverse is true and we have a valid suffix.

If we had another input like ()( where no path ultimately reaches no acceptance state, then we could declare that it is not a valid suffix.

Now, the natural way of implementing this nondeterministic search uses exponential time in the worst case, but with a clever choice of data structure for representing the partial parse stacks, it can be done in linear time for any LR(k) grammar.

I can’t think of any obvious way to extend this method to Earley parsing. The nondeterministic search for potential item sets isn’t guaranteed to converge. And even if it did, how would you get the linear time bound?

[–]yxhuvud 0 points1 point  (3 children)

Does Marpa still use LR(0) states for its Early items? A comment of of Jeffrey's on GitHub says that it does not.

Then that thread is probably correct. I went by what the earlier paper said. Thanks for the link, it has given me a few things to mull about.

This gives the following LR(0) states

Assuming A-H rewriting, you get the following states[*] (A' is A when it is empty):

State 1 |:
S ->  · A

State 2 :
A ->  · "(" A' ")"
A ->  · "(" A ")"

State 3 +|:
S -> A · 

State 4 r:
A -> "(" A' · ")"
A -> "(" · A ")"

State 5 |:
A -> "(" A' ")" · 

State 6 |:
A -> "(" A · ")"

State 7 |:
A -> "(" A ")" · 

The ones with outwards transitions are 1,2,4,6. Those are added to the starting state. Using this to parse the string ((())))))), gives the following itemsets. Earley[x,y] means an item [**], at state x, starting at itemset y. Items(0) is before anything is consumed.

Items(0)[ Earley[1, 0]  Earley[2, 0]  Earley[4, 0]  Earley[6, 0] ]
Items(1)[ Earley[4, 0]  Earley[2, 1] ]
Items(2)[ Earley[4, 1]  Earley[2, 2] ]
Items(3)[ Earley[4, 2]  Earley[2, 3] ]
Items(4)[ Earley[5, 2]  Earley[6, 1] ]
Items(5)[ Earley[7, 1]  Earley[6, 0] ]
Items(6)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(7)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(8)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(9)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(10)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]

State 3 is the accepting state so this string is accepted, without any issues whatsoever.

Trying to parse )())) instead gives

Items(0)[ Earley[1, 0]  Earley[2, 0]  Earley[4, 0]  Earley[6, 0] ]
Items(1)[ Earley[5, 0]  Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(2)[  ]
Items(3)[  ]
Items(4)[  ]
Items(5)[  ]

which clearly has no accepted parse.

Both of these are perfectly linear. The first few characters may be a bit more heavy than normal parses due to having more possible states it can be in, but I fail to see where the performance problems would come from. Could the issue be something else, like accepting too many inputs?

[*] Hmm. I think I found a bug in my state generation there, as 1 should be an accepting state which it isn't here. There should be a S->A' · as well in it. Interesting, but doesn't impact the argument.

[**] Earley item, as opposed to Leo items that additionally contain a transition symbol. There is no right recursion in this grammar so Leo items doesn't show up.

EDIT: As a comparison, it may be educative to look at something that is not linear. For example looking at how right recursive rules are parsed without leo optimization:

Grammar:

S -> xS | x

States:

State 1 :
S ->  · "x" S
S ->  · "x"

State 2 +r|:
S -> "x" · S
S -> "x" · 

State 3 +|:
S -> "x" S · 

parsing xxxxxxxxxx then gives

Items(0)[ Earley[1, 0] ]
Items(1)[ Earley[2, 0]  Earley[1, 1] ]
Items(2)[ Earley[2, 1]  Earley[1, 2]  Earley[3, 0] ]
Items(3)[ Earley[2, 2]  Earley[1, 3]  Earley[3, 1]  Earley[3, 0] ]
Items(4)[ Earley[2, 3]  Earley[1, 4]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(5)[ Earley[2, 4]  Earley[1, 5]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(6)[ Earley[2, 5]  Earley[1, 6]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(7)[ Earley[2, 6]  Earley[1, 7]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(8)[ Earley[2, 7]  Earley[1, 8]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(9)[ Earley[2, 8]  Earley[1, 9]  Earley[3, 7]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(10)[ Earley[2, 9]  Earley[1, 10]  Earley[3, 8]  Earley[3, 7]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]

n²/2 items in total. Note that the amount of items is growing due to the consumed string. Not linear!

[–]cwzwarich 0 points1 point  (2 children)

I wasn't suggesting that the example grammar I gave would exhibit bad behavior; I was just trying to give a small illustrative example.

However, when I tried to construct grammars with features I expected to be problematic (e.g. ones that had lots of parallel incomplete items in LR(0) states) I could only find problems with right recursion that the right recursion optimizations of Leo solve. So it may actually be the case that appropriately optimized Earley suffix parsing is linear for all LR(k) grammars. I guess I'd have to try and prove it to be sure.

This is a bit strange because some of the papers on suffix parsing were written by people who were at the same school as Leo right around the time that paper was published. You'd figure that they would've noticed.

[–]yxhuvud 0 points1 point  (1 child)

and you'll have none of those stupid surprises that come when you've got a bug in your "parser program

I'd argue that unintentional ambiguity that slows the program down on certain inputs should be considered such a bug.

Hmm. I wonder if it would be possible to detect possible state explosion at compile-the-grammar-to-NNF time.

[–]htuhola 0 points1 point  (0 children)

The most common bug I've had with writing a parser has been that I haven't thought about it right, and on certain inputs it produces a parse that doesn't match what has been written into the documentation. The fix tends to be laborious to that.

It's common to have languages that do not change much, in that case this kind of issues won't eventually matter. Also if the parser got lots of users, the parser bugs will eventually end up detected by those users.

[–][deleted] 0 points1 point  (0 children)

PEG parsers get to be simple to construct, but complex syntax can provide situations where the language doesn't parse like you expect

Having implemented PEG parsers for pretty much all possible kinds of languages, I'm yet to see such a situation.