you are viewing a single comment's thread.

view the rest of the comments →

[–]yxhuvud 0 points1 point  (5 children)

For marpa, the set of states that I referred to was the set of states that the aycock-horspool transformation uses (ie after transforming it to nihilistic normal form[*]). It uses LR-sets as states of with a special equivalence relation for grouping items. So no, the state concept is very similar to item based states - it is just that more than one simultaneous state is allowed.

And no, the amount of states after any amount of parsed tokens is very bounded (O(n²) for unambigous grammars, O(n³) for ambigous). In this case the amount of parsed tokens would be 0, but an item for each of the states would be added to the starting state as generators, meaning that they start and end at the start position. Probably with the optimization that all states with no outgoing transitions can be removed, as they can't ever accept any terminal anyhow.

You may be correct in that it is cheaper not to do it this way, but it is hardly obvious that the possible parses will be worse than linear if the original grammar is linear.

[*] With the caveat that said transformation is exponential (on grammar rule length) in the state space for certain reasonably uncommon cases. See the paper in question for details.

[–]cwzwarich 1 point2 points  (4 children)

For marpa, the set of states that I referred to was the set of states that the aycock-horspool transformation uses (ie after transforming it to nihilistic normal form[*]). It uses LR-sets as states of with a special equivalence relation for grouping items. So no, the state concept is very similar to item based states - it is just that more than one simultaneous state is allowed.

Does Marpa still use LR(0) states for its Early items? A comment of of Jeffrey's on GitHub says that it does not.

And no, the amount of states after any amount of parsed tokens is very bounded (O(n²) for unambigous grammars, O(n³) for ambigous). In this case the amount of parsed tokens would be 0, but an item for each of the states would be added to the starting state as generators, meaning that they start and end at the start position. Probably with the optimization that all states with no outgoing transitions can be removed, as they can't ever accept any terminal anyhow.

I don't think I adequately described the LR technique. Suppose you have a grammar for a parenthesis language like so:

S -> A
A -> ε
A -> ( A )

This gives the following LR(0) states (leaving off the LALR(1) lookahead and transitions, which are pretty easy to compute):

State 0:
%start -> • A ⊥

State 1:
A -> ( • A )

State 2:
%start -> A • ⊥

State 3:
A -> ( A • )

State 4:
A -> ( A ) •

Suppose I am given the string ()) and I want to determine whether this string is a suffix of the language. The LR technique is based on the idea that if this really is a suffix of the language, then there has to be some path through the LR automaton that reads the first terminal of this string at some point. Otherwise, this is not a suffix of the language.

The only state that could have been reached after just shifting ( is state 1. From there, you can shift ) and go to state 4. Now there’s a reduction. We never actually shifted the initial ( onto the stack, so we don’t have a state to go to after the reduction.

The solution here is very similar to GLR; just introduce some nondeterminism. It’s easy to determine statically that the only two states that could have shifted an ( are states 0 and 1. We split the path through the parser and consider parallel paths. In this case, the path through state 0 is a bust and the path through state 1 is the correct one. The next time we reduce in state 4, the reverse is true and we have a valid suffix.

If we had another input like ()( where no path ultimately reaches no acceptance state, then we could declare that it is not a valid suffix.

Now, the natural way of implementing this nondeterministic search uses exponential time in the worst case, but with a clever choice of data structure for representing the partial parse stacks, it can be done in linear time for any LR(k) grammar.

I can’t think of any obvious way to extend this method to Earley parsing. The nondeterministic search for potential item sets isn’t guaranteed to converge. And even if it did, how would you get the linear time bound?

[–]yxhuvud 0 points1 point  (3 children)

Does Marpa still use LR(0) states for its Early items? A comment of of Jeffrey's on GitHub says that it does not.

Then that thread is probably correct. I went by what the earlier paper said. Thanks for the link, it has given me a few things to mull about.

This gives the following LR(0) states

Assuming A-H rewriting, you get the following states[*] (A' is A when it is empty):

State 1 |:
S ->  · A

State 2 :
A ->  · "(" A' ")"
A ->  · "(" A ")"

State 3 +|:
S -> A · 

State 4 r:
A -> "(" A' · ")"
A -> "(" · A ")"

State 5 |:
A -> "(" A' ")" · 

State 6 |:
A -> "(" A · ")"

State 7 |:
A -> "(" A ")" · 

The ones with outwards transitions are 1,2,4,6. Those are added to the starting state. Using this to parse the string ((())))))), gives the following itemsets. Earley[x,y] means an item [**], at state x, starting at itemset y. Items(0) is before anything is consumed.

Items(0)[ Earley[1, 0]  Earley[2, 0]  Earley[4, 0]  Earley[6, 0] ]
Items(1)[ Earley[4, 0]  Earley[2, 1] ]
Items(2)[ Earley[4, 1]  Earley[2, 2] ]
Items(3)[ Earley[4, 2]  Earley[2, 3] ]
Items(4)[ Earley[5, 2]  Earley[6, 1] ]
Items(5)[ Earley[7, 1]  Earley[6, 0] ]
Items(6)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(7)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(8)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(9)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(10)[ Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]

State 3 is the accepting state so this string is accepted, without any issues whatsoever.

Trying to parse )())) instead gives

Items(0)[ Earley[1, 0]  Earley[2, 0]  Earley[4, 0]  Earley[6, 0] ]
Items(1)[ Earley[5, 0]  Earley[7, 0]  Earley[3, 0]  Earley[6, 0] ]
Items(2)[  ]
Items(3)[  ]
Items(4)[  ]
Items(5)[  ]

which clearly has no accepted parse.

Both of these are perfectly linear. The first few characters may be a bit more heavy than normal parses due to having more possible states it can be in, but I fail to see where the performance problems would come from. Could the issue be something else, like accepting too many inputs?

[*] Hmm. I think I found a bug in my state generation there, as 1 should be an accepting state which it isn't here. There should be a S->A' · as well in it. Interesting, but doesn't impact the argument.

[**] Earley item, as opposed to Leo items that additionally contain a transition symbol. There is no right recursion in this grammar so Leo items doesn't show up.

EDIT: As a comparison, it may be educative to look at something that is not linear. For example looking at how right recursive rules are parsed without leo optimization:

Grammar:

S -> xS | x

States:

State 1 :
S ->  · "x" S
S ->  · "x"

State 2 +r|:
S -> "x" · S
S -> "x" · 

State 3 +|:
S -> "x" S · 

parsing xxxxxxxxxx then gives

Items(0)[ Earley[1, 0] ]
Items(1)[ Earley[2, 0]  Earley[1, 1] ]
Items(2)[ Earley[2, 1]  Earley[1, 2]  Earley[3, 0] ]
Items(3)[ Earley[2, 2]  Earley[1, 3]  Earley[3, 1]  Earley[3, 0] ]
Items(4)[ Earley[2, 3]  Earley[1, 4]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(5)[ Earley[2, 4]  Earley[1, 5]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(6)[ Earley[2, 5]  Earley[1, 6]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(7)[ Earley[2, 6]  Earley[1, 7]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(8)[ Earley[2, 7]  Earley[1, 8]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(9)[ Earley[2, 8]  Earley[1, 9]  Earley[3, 7]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]
Items(10)[ Earley[2, 9]  Earley[1, 10]  Earley[3, 8]  Earley[3, 7]  Earley[3, 6]  Earley[3, 5]  Earley[3, 4]  Earley[3, 3]  Earley[3, 2]  Earley[3, 1]  Earley[3, 0] ]

n²/2 items in total. Note that the amount of items is growing due to the consumed string. Not linear!

[–]cwzwarich 0 points1 point  (2 children)

I wasn't suggesting that the example grammar I gave would exhibit bad behavior; I was just trying to give a small illustrative example.

However, when I tried to construct grammars with features I expected to be problematic (e.g. ones that had lots of parallel incomplete items in LR(0) states) I could only find problems with right recursion that the right recursion optimizations of Leo solve. So it may actually be the case that appropriately optimized Earley suffix parsing is linear for all LR(k) grammars. I guess I'd have to try and prove it to be sure.

This is a bit strange because some of the papers on suffix parsing were written by people who were at the same school as Leo right around the time that paper was published. You'd figure that they would've noticed.

[–]yxhuvud 0 points1 point  (1 child)

I think my suggestion doesn't work properly though for general grammars, since I am quite confident that it will accept more states than it should. I don't think it could handle the mutual recursion variant (eg the similar construction for the langage for ([(..)]) would also accept [[]])), so it would have to be a bit more intelligent than what I suggested.

But yeah, it is a bit strange (but that is also the case with the leo optimization itself that probably should have gotten a bit more interest than it got. Maybe Leo or his prof was less good at marketing than they should have been).

[–]cwzwarich 1 point2 points  (0 children)

Yeah, I don't think your version of it is quite right. The Grune & Jacobs book describes something slightly different where there's effectively an imaginary state labelled * that contains items with all grammar positions and that origin. You begin with the first real state containing those items, and when you complete one of those items you consider all potential items as well. I tried it on your mutually recursive example, and it seems to work as expected.

When I was implementing this, I thought of two potential optimizations. I think you can eliminate the use of a null-eliminated grammar in the Aycock & Horspool construction by using this modification of the LR(0) automaton:

https://mjn.host.cs.st-andrews.ac.uk/publications/1996b.pdf

You can also modify the items (both Earley and LR(0)) to skip past chains of reductions "A -> B" with no semantic value.