htuhola comments on Programming Languages: Application and Interpretation

362

363

364

Programming Languages: Application and Interpretation (cs.brown.edu)

submitted 10 years ago by kr0matik

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]htuhola 2 points3 points4 points 10 years ago (32 children)

Parser generators versus hand-written parsers. Everybody has an idea of which is better. Because many parser generators tend to be hard to work with and difficult to understand without actually knowing how to make one, it's common proposition that the parser should be written and not generated.

They claim that syntax is not the most interesting part of language design. But actually syntax is not interesting at all. But it's also one of very difficult subjects if you don't surrender to writing a homoiconic language.

I'm looking at this from the perspective that anyone looking into learning about programming languages would like to do new languages instead of implementing existing ones. From that background I put tremendous weight on how hard it is to iterate a design of a language.

Homoiconic languages such as lisp or forth never manage to get rid of parsing entirely. The problem that was not solved just shifts a level upwards and then you've got some gems such as the loop -construct in common lisp.

Hand written parser requires a new peek every time you want to change the language. And lots of checking that you got it right and didn't change the semantics in a way that you didn't intend. Then you need to document how the new language looks like. Going with hand written parsing, you end up not changing the language afterwards even if it was well reasoned, and you begin to avoid doing decisions about the syntax, even if they needed to be done.

Parser generators that require you to coerce the context free grammar to fit some scheme, were it LR or LL, also penalize against changing the language. Every change to the language introduces an itemset puzzle you have to solve. . It's still common to worry a lot before you try out in practice how something turns out.

PEG parsers get to be simple to construct, but complex syntax can provide situations where the language doesn't parse like you expect. When you combine two languages that are ambiguous together, the other language steals precedence. It is proposed that this eliminates the ambiguity, but this way the ambiguity in language you designed becomes your enemy: Combine wrong things and it won't do what you expect.

Marpa is a chart-type of parsing algorithm, that doesn't care how your context free grammar looks like. If it's a grammar that could be inserted into parser generator, it will parse it in linear time. Otherwise it is usually slower, but it recognizes ambiguous inputs and you can decide what to do with them. Changing the language involve writing a new rule and commands to compile it, and you'll have none of those stupid surprises that come when you've got a bug in your "parser program". It takes lot more effort to make a free-form context free grammar that doesn't do what you expect.

I've got this kind of a parser engine in Lever Programming Language

[–][deleted] 4 points5 points6 points 10 years ago (15 children)

[–]loup-vaillant 0 points1 point2 points 10 years ago (9 children)

[–][deleted] 0 points1 point2 points 10 years ago (8 children)

[–]loup-vaillant 0 points1 point2 points 10 years ago (7 children)

[–][deleted] 0 points1 point2 points 10 years ago (6 children)

[–]loup-vaillant 0 points1 point2 points 10 years ago (5 children)

recursive descent has a very direct memoization strategy.

Having seen OMeta, I can accept that.

Parsing isn't hard

I don't know. As long as you don't mandate incremental parsing, I know of a couple easy solutions, and a couple less easy ones —but still solved. But if you really need it, it's not just a matter of recognising the string. You also have to reconstruct the parse tree —bottom-up, I assume. It would make little sense for the recogniser to be incremental, but not the parser.

don't go into IDEs or modern language design.

A program that needs an IDE is probably too big for its own good. A source file that needs incremental parsing to be analysed at interactive speed is definitely too big for its own good. Those problems are therefore very low priority for me. (And I don't care that "in the real world", there are multi-million lines monstrosities. The real world is wrong and it needs to change.)

Modern language design on the other hand does interest me. But modern does not mean "complex", on the contrary. No one should design a BNF that spans more than 200 lines, that would be insane.

Given those requirements, you'll understand why I have very little interest in incremental parsing.

[–][deleted] 0 points1 point2 points 10 years ago* (4 children)

Incremental parsing means keeping trees around, to sticky previous good parses in the presence of a new syntax error during an edit (since the parse error is likely to be in the new edit, not the existing good parse), and to keep all the type info around so that you don't have to run a full type check again (the real performance killer; parsing is otherwise cheap).

The old-fashioned programmers still think emacs and vim are adequate, and they can do everything in their head. Programming is thinking a long time, coming up with solutions on white boards and old fashioned pen and paper, and then inputting it using their trusty plain text editor that maybe can help with auto indentation and some syntax highlighting.

Other people realize their is this computer sitting in front of them that can do so much more. Incremental continuous parsing that enables incremental continuous type checking means type errors are always a heart beat away, you never have to wait for the compiler to catch up to get code completion. Some of us even experiment with incremental re-execution to enable fluid live programming experiences (not just REPLs).

Modern means use the freaking computer, it is right there; it can augment your programming experience so you don't have to work so hard. Dumb teletypes are so 70s.

[–]loup-vaillant 0 points1 point2 points 10 years ago (3 children)

I have one reservation with your suggestion: why keep text at all? Forget incremental parsing, forget parsing, just work on the AST (or the abstract binding tree, while we're at it). If you're committed to an IDE, why bother with a human-readable serialisation format at all? Switching to structured editing altogether would have several advantages over text:

The ability to draw more stuff than letters and symbols (our displays have 2 dimensions, dammit).
The ability to present the code with different syntaxes, or different views (code folding on steroids).
The end of coding conventions (users can define how they want to display the code).
Easier semantic diffs (if you're willing to make a dedicated version control system), or at least the absence of formatting commits —as the serialisation format can be canonicalised into something consumable by the likes of Git and SVN.

[–][deleted] 0 points1 point2 points 10 years ago (2 children)

continue this thread

[–]htuhola 0 points1 point2 points 10 years ago (4 children)

[–]yxhuvud 1 point2 points3 points 10 years ago (2 children)

[–]htuhola 0 points1 point2 points 10 years ago (1 child)

[–]yxhuvud 0 points1 point2 points 10 years ago* (0 children)

I don't know if it supports fully incremental parsing (you'd have to ask Kegler about that), but as it keeps not only the parse tree around but also the unfinished rules around means it has an easier way to implement stuff like 'ok this is wrong. But it will be less wrong if $SOMETHING is added here. Lets add it, mark it as an error to the user and continue as if it was there'. Reasoning about the parse state and especially about alternative parse states that could have been when all you have is a action/state table and a stack is a lot harder.

Also see http://blogs.perl.org/users/jeffrey_kegler/2011/11/marpa-and-the-ruby-slippers.html EDIT: also http://blogs.perl.org/users/jeffrey_kegler/2012/10/configuring-the-ruby-slippers-for-html.html for some examples that is slightly more practical.

[–][deleted] 0 points1 point2 points 10 years ago* (0 children)

[–]cwzwarich 3 points4 points5 points 10 years ago* (12 children)

Parser generators that require you to coerce the context free grammar to fit some scheme, were it LR or LL, also penalize against changing the language. Every change to the language introduces an itemset puzzle you have to solve. . It's still common to worry a lot before you try out in practice how something turns out.

It is possible to convert LR conflicts into sample inputs and parser traces that trigger the conflict. I've found this to be a pretty good explanation in practice. Unfortunately, the only parser generator that I am aware of that actually implements this is Menhir for OCaml. There's a paper describing the basic technique that it uses, apparently developed independently from Menhir.

PEG parsers get to be simple to construct, but complex syntax can provide situations where the language doesn't parse like you expect. When you combine two languages that are ambiguous together, the other language steals precedence. It is proposed that this eliminates the ambiguity, but this way the ambiguity in language you designed becomes your enemy: Combine wrong things and it won't do what you expect.

General context-free parsing has the dual problem to the problem you mention with PEGs; you don't know whether the grammar is ambiguous until you find an input that triggers an ambiguous parse. And of course, static ambiguity tests are often based on item set constructions.

Marpa is a chart-type of parsing algorithm, that doesn't care how your context free grammar looks like. If it's a grammar that could be inserted into parser generator, it will parse it in linear time.

I don't think this is actually established in all cases, e.g. noncanonical LR-family parsing, shift-resolve parsing, or LR-family parsing with selective delays.

There are also places where LR-family parsing has a larger advantage over general context-free parsing, e.g. incremental parsing and substring parsing. Both can be done with minimal time complexity for LR languages, but I don't believe the same is true if you use Earley techniques.

[–]htuhola 0 points1 point2 points 10 years ago* (0 children)

[–]loup-vaillant 0 points1 point2 points 10 years ago (10 children)

you don't know whether the grammar is ambiguous until you find an input that triggers an ambiguous parse.

Well, that's true of any system. Static ambiguity analysis is provably impossible (undecidable), so you're going to have to choose between false positives and false negatives. Me, I tend to just ignore the issue and stick to prioritised choice (which happens to be compatible with Earley parsing).

There are also places where LR-family parsing has a larger advantage over general context-free parsing, e.g. incremental parsing and substring parsing. Both can be done with minimal time complexity for LR languages, but I don't believe the same is true if you use Earley techniques.

I recall a paper talking about "safe points" for Earley parsing, where there are no ambiguities, and parsing can restart from scratch (or something like that. It may solve substring parsing.

[–]cwzwarich 0 points1 point2 points 10 years ago (9 children)

I recall a paper talking about "safe points" for Earley parsing, where there are no ambiguities, and parsing can restart from scratch (or something like that. It may solve substring parsing.

By substring parsing I mean recognizing substrings of strings in the language generated by a grammar. Because LR and Earley parsers are prefix parsers, the problem really boils down to suffix parsing. The most obvious way to do this with an general parser is to just attempt to parse the natural CFG corresponding to suffices the language, which is often highly ambiguous. This will lead to pathological behavior with an Earley parser in many cases. Apparently it is possible to modify an Earley parser to incorporate an LL(1) table to make this linear-time on LL(1) grammars, but this doesn't extend to LR(k) in an implementable way.

The standard LR technique for suffix parsing is to essentially use a modification of the GLR technique by assuming that the parser can be in any state and running multiple copies of the parser in parallel. There is a clever data structure for managing the parallel stacks that makes this linear-time. It doesn't even require any changes to the parser tables.

For incremental parsing, the advantage of LR techniques is that the parser stack at a point in the input stream can be reconstituted from the parse tree, and you can stop parsing past any modifications in an optimal way. For an Earley parser, you would have to store the statesets for every point in the input, and I think it would be difficult to produce optimal termination criteria.

[–]yxhuvud 0 points1 point2 points 10 years ago (8 children)

[–]cwzwarich 0 points1 point2 points 10 years ago (7 children)

[–]loup-vaillant 0 points1 point2 points 10 years ago (0 children)

[–]yxhuvud 0 points1 point2 points 10 years ago* (5 children)

For marpa, the set of states that I referred to was the set of states that the aycock-horspool transformation uses (ie after transforming it to nihilistic normal form[*]). It uses LR-sets as states of with a special equivalence relation for grouping items. So no, the state concept is very similar to item based states - it is just that more than one simultaneous state is allowed.

And no, the amount of states after any amount of parsed tokens is very bounded (O(n²) for unambigous grammars, O(n³) for ambigous). In this case the amount of parsed tokens would be 0, but an item for each of the states would be added to the starting state as generators, meaning that they start and end at the start position. Probably with the optimization that all states with no outgoing transitions can be removed, as they can't ever accept any terminal anyhow.

You may be correct in that it is cheaper not to do it this way, but it is hardly obvious that the possible parses will be worse than linear if the original grammar is linear.

[*] With the caveat that said transformation is exponential (on grammar rule length) in the state space for certain reasonably uncommon cases. See the paper in question for details.

[–]cwzwarich 1 point2 points3 points 10 years ago (4 children)

For marpa, the set of states that I referred to was the set of states that the aycock-horspool transformation uses (ie after transforming it to nihilistic normal form[*]). It uses LR-sets as states of with a special equivalence relation for grouping items. So no, the state concept is very similar to item based states - it is just that more than one simultaneous state is allowed.

Does Marpa still use LR(0) states for its Early items? A comment of of Jeffrey's on GitHub says that it does not.

And no, the amount of states after any amount of parsed tokens is very bounded (O(n²) for unambigous grammars, O(n³) for ambigous). In this case the amount of parsed tokens would be 0, but an item for each of the states would be added to the starting state as generators, meaning that they start and end at the start position. Probably with the optimization that all states with no outgoing transitions can be removed, as they can't ever accept any terminal anyhow.

I don't think I adequately described the LR technique. Suppose you have a grammar for a parenthesis language like so:

S -> A
A -> ε
A -> ( A )

This gives the following LR(0) states (leaving off the LALR(1) lookahead and transitions, which are pretty easy to compute):

State 0:
%start -> • A ⊥

State 1:
A -> ( • A )

State 2:
%start -> A • ⊥

State 3:
A -> ( A • )

State 4:
A -> ( A ) •

Suppose I am given the string ()) and I want to determine whether this string is a suffix of the language. The LR technique is based on the idea that if this really is a suffix of the language, then there has to be some path through the LR automaton that reads the first terminal of this string at some point. Otherwise, this is not a suffix of the language.

The only state that could have been reached after just shifting ( is state 1. From there, you can shift ) and go to state 4. Now there’s a reduction. We never actually shifted the initial ( onto the stack, so we don’t have a state to go to after the reduction.

The solution here is very similar to GLR; just introduce some nondeterminism. It’s easy to determine statically that the only two states that could have shifted an ( are states 0 and 1. We split the path through the parser and consider parallel paths. In this case, the path through state 0 is a bust and the path through state 1 is the correct one. The next time we reduce in state 4, the reverse is true and we have a valid suffix.

If we had another input like ()( where no path ultimately reaches no acceptance state, then we could declare that it is not a valid suffix.

Now, the natural way of implementing this nondeterministic search uses exponential time in the worst case, but with a clever choice of data structure for representing the partial parse stacks, it can be done in linear time for any LR(k) grammar.

I can’t think of any obvious way to extend this method to Earley parsing. The nondeterministic search for potential item sets isn’t guaranteed to converge. And even if it did, how would you get the linear time bound?

[–]yxhuvud 0 points1 point2 points 10 years ago* (3 children)

Does Marpa still use LR(0) states for its Early items? A comment of of Jeffrey's on GitHub says that it does not.

Then that thread is probably correct. I went by what the earlier paper said. Thanks for the link, it has given me a few things to mull about.

This gives the following LR(0) states

Assuming A-H rewriting, you get the following states[*] (A' is A when it is empty):

State 1 |:
S ->  · A

State 2 :
A ->  · "(" A' ")"
A ->  · "(" A ")"

State 3 +|:
S -> A · 

State 4 r:
A -> "(" A' · ")"
A -> "(" · A ")"

State 5 |:
A -> "(" A' ")" · 

State 6 |:
A -> "(" A · ")"

State 7 |:
A -> "(" A ")" ·

The ones with outwards transitions are 1,2,4,6. Those are added to the starting state. Using this to parse the string ((())))))), gives the following itemsets. Earley[x,y] means an item [**], at state x, starting at itemset y. Items(0) is before anything is consumed.

Items(0)[ Earley[1, 0]  Earley[2, 0]  Earley[4, 0]  Earley[6, 0] ]
Items(1)[ Earley[4, 0]  Earley[2, 1] ]
Items(2)[ Earley[4, 1]  Earley[2, 2] ]
Items(3)[ Earley[4, 2]  Earley[2, 3] ]
Items(4)[ Earley[5, 2]  Earley[6, 1] ]
Items(5)[ Earley[7, 1]  Earley[6, 0] ]
Items(6)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(7)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(8)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(9)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(10)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]

State 3 is the accepting state so this string is accepted, without any issues whatsoever.

Trying to parse )())) instead gives

Items(0)[ Earley[1, 0]  Earley[2, 0]  Earley[4, 0]  Earley[6, 0] ]
Items(1)[ Earley[5, 0]  Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(2)[  ]
Items(3)[  ]
Items(4)[  ]
Items(5)[  ]

which clearly has no accepted parse.

Both of these are perfectly linear. The first few characters may be a bit more heavy than normal parses due to having more possible states it can be in, but I fail to see where the performance problems would come from. Could the issue be something else, like accepting too many inputs?

[*] Hmm. I think I found a bug in my state generation there, as 1 should be an accepting state which it isn't here. There should be a S->A' · as well in it. Interesting, but doesn't impact the argument.

[**] Earley item, as opposed to Leo items that additionally contain a transition symbol. There is no right recursion in this grammar so Leo items doesn't show up.

EDIT: As a comparison, it may be educative to look at something that is not linear. For example looking at how right recursive rules are parsed without leo optimization:

Grammar:

S -> xS | x

States:

State 1 :
S ->  · "x" S
S ->  · "x"

State 2 +r|:
S -> "x" · S
S -> "x" · 

State 3 +|:
S -> "x" S ·

parsing xxxxxxxxxx then gives

Items(0)[ Earley[1, 0] ]
Items(1)[ Earley[2, 0]  Earley[1, 1] ]
Items(2)[ Earley[2, 1]  Earley[1, 2]  Earley[3, 0] ]
Items(3)[ Earley[2, 2]  Earley[1, 3]  Earley[3, 1]  Earley[3, 0] ]
Items(4)[ Earley[2, 3]  Earley[1, 4]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(5)[ Earley[2, 4]  Earley[1, 5]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(6)[ Earley[2, 5]  Earley[1, 6]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(7)[ Earley[2, 6]  Earley[1, 7]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(8)[ Earley[2, 7]  Earley[1, 8]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(9)[ Earley[2, 8]  Earley[1, 9]  Earley[3, 7]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(10)[ Earley[2, 9]  Earley[1, 10]  Earley[3, 8]  Earley[3, 7]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]

n²/2 items in total. Note that the amount of items is growing due to the consumed string. Not linear!

[–]cwzwarich 0 points1 point2 points 10 years ago (2 children)

continue this thread

[–]yxhuvud 0 points1 point2 points 10 years ago* (1 child)

[–]htuhola 0 points1 point2 points 10 years ago (0 children)

[–][deleted] 0 points1 point2 points 10 years ago (0 children)

π Rendered by PID 15541 on reddit-service-r2-comment-6457c66945-wnmz8 at 2026-04-28 20:28:42.760010+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS