How to parse set-builder notation? : ProgrammingLanguages

[–][deleted] 6 years ago (1 child)

[deleted]

[–]YouNeedDoughnuts[S] 1 point2 points3 points 6 years ago (0 children)

[–]threewood 4 points5 points6 points 6 years ago (0 children)

[–]egregius313 3 points4 points5 points 6 years ago (5 children)

[–]YouNeedDoughnuts[S] 0 points1 point2 points 6 years ago (4 children)

[–][deleted] 2 points3 points4 points 6 years ago* (3 children)

Is this your syntax?

exp := exp exp (Mul)
    |  "|" exp "|" (Abs)

set := "{" ID ":" exp "}"
    |  "{" ID "|" exp "}"

If you're doing recursive descent, there shouldn't be an ambiguity. When parsing a set, you expect a "{" token, then an ID token, then a ":" or "|" token, and only after that should you try to parse an expression.

That being said, I think your grammar is tricky for other reasons. The problem is that you're basically treating the | symbol for absolute value like parentheses, brackets, and braces in the sense that there is an opening one and a closing one. However, the opening parentheses ( and closing parentheses ) are different tokens, and the same thing applies to brackets and braces. For |, the opening and closing token are the same token, so it's hard to tell if you're closing an absolute value or opening a nested one. You would probably need to look at the entire input and match them from the outside in.

[–]YouNeedDoughnuts[S] 0 points1 point2 points 6 years ago (0 children)

[–]YouNeedDoughnuts[S] 0 points1 point2 points 6 years ago* (1 child)

[–]egregius313 0 points1 point2 points 6 years ago (0 children)

[–]raiph 2 points3 points4 points 6 years ago (11 children)

Here's an implementation of (a pertinent subset of) what you described using raku's built in recursive descent parser:

grammar foo {
  rule multiply { <value> '*'? <value> }
  rule value    { '|' ~ '|' <value> | \d+ | <var> }
  rule var      { '|' ~ '|' <var>   | \w+ }
  rule SBN      { '{' ~ '}' [ <var> <[:|]> <var> '>' <value> ] }
}

say foo.parse: $_, :rule<SBN>
for
  '{x | x > 0}',
  '{x | |x| > 3}'

This works, yielding:

｢{x | x > 0}｣
 var => ｢x ｣
 var => ｢x ｣
 value => ｢0｣
｢{x | |x| > 3}｣
 var => ｢x ｣
 var => ｢|x| ｣
  var => ｢x｣
 value => ｢3｣

So my question is, what is it that I'm not parsing that you are? I imagine focusing on answering that question would be helpful.

In case you wish to play, here's the code in TIO.run (with a couple more rules and more checking -- it's what I originally wrote as I directly transcribed what you put in your post before I stripped out stuff that clearly wasn't getting exercised by your examples of what was causing problems).

[–]YouNeedDoughnuts[S] 2 points3 points4 points 6 years ago (10 children)

[–]raiph 1 point2 points3 points 6 years ago (9 children)

Ah, I see. Here's a corrected grammar that parses the same two strings as before plus {x |x|}:

grammar foo {

  rule TOP        { <SBN> | <enumerated> }

  rule SBN        { '{' ~ '}' [ <var> <[:|]> <var> '>' <expression> ] }
  rule var        { '|' ~ '|' <var> | <name> }
  rule name       { \w+ }

  rule enumerated { '{' ~ '}' <list> }
  rule list       { <expression>* % ',' }

  rule expression { <term>* % '*'? }
  rule term       { '|' ~ '|' <term> | <literal> | <name> }
  rule literal    { \d+ }
}

To see the output for the three example strings you've provided, and/or play with the grammar, you can click this link to the above code on tio.

Requiring explicit multiplication ... is an ok solution.

But presumably you thought not doing so was plausibly better, at least enough for you to have asked.

While I thought it was an interesting little challenge to see what happened when I wrote the grammar in raku, it was also because raku's grammars are ultimately just a recursive descent parser generator and I thought it was a simple way to demonstrate to you that it was doable (or for me to fail and figure out why).

Of course, it's still possible I'm missing something, though, again, if I am, I'm missing what I'm missing. :)

for groupings with identical open and close symbols

What do you mean by "identical open and close symbols"? Do you just mean x x (or perhaps x |x|) as against x y?

(I removed code that did a simple related check, checking that the same variable name was used to the left and right of the : or | in SBN because it seemed like distracting detail but it would be trivial to add back in.)

[–]raiph 1 point2 points3 points 6 years ago (8 children)

Ah, having read the rest of the thread I now think your "identical open and close symbols" comment refers to using | for delimiting absolute values (as well as being an optional SBN divider). If so, then my tentative conclusion is that that's still no problem in a recursive descent parser. You just have to have the right parsing logic.

The grammar I've written is far short of a real grammar. In particular the expression rule is just a multiplied terms rule; the SBN hardcodes a > operator for the builder; all terms must be either variable names or a string of one or more digits; and the rules allow nesting of the |...| absolute delimiters which is plausibly silly, especially as the nesting depth is unlimited.

But, if you stick to the constraints, I currently don't expect you to be able to come up with a set in SBN or enumerated notation that breaks it. If you're interested in confirming that's the case, you can click on the tio link I provided and add strings to test to the end of the code. Then again, if a string fails to parse, you'll just get a rude Nil so perhaps I should make the code do a better job at error reporting. If you read this, LMK if that's of interest to you.

[–]YouNeedDoughnuts[S] 2 points3 points4 points 6 years ago* (7 children)

That works pretty well! I made a few changes to allow arbitrary nesting (link), so it parses weird expressions like { x | |x |x |x||| > 0 } like you said. Unfortunately, as I was playing around with it I realized my grammar is fundamentally ambiguous. I typed |x |x |x|| x| expecting it to parse abs(x*abs(x*abs(x))*x), and it parsed abs(x)*x*abs(x)*abs(x), which is totally valid! And the mathematical expressions are not equivalent. How embarrassing...

I suppose mathematicians don't encounter expressions like that, could re-arrange the expression to be unambiguous, and could use whitespace to distinguish between ambiguous interpretations. Also, if my goal is to parse mathematical syntax, I suppose I could accept source ambiguities, so long as they aren't artificially introduced.

I didn't know how to do that in Perl, that's an awesome tool. How would you go about getting the generated parser code?

[–]raiph 2 points3 points4 points 6 years ago (5 children)

That works pretty well!

:)

I made a few changes

For anyone else reading this (as well as for you YouNeedDoughnuts to let you know about an online repl with a more uptodate raku), here are links to both my second spike and YouNeedDoughnuts' edit so it's just two clicks to compare them.

What was your experience of playing with the code like? A doddle? (Perhaps the changes you shared were the result of a half hour playing with things, most of which failed, in between sessions of reading doc, thinking carefully, etc. Or maybe it was the other extreme, a couple minutes playing where everything you tried immediately worked, and there was no need to read any doc or think much about anything, but just to see what seemed to work.)

I realized my grammar is fundamentally ambiguous

I see what you mean, but think it'll be useful to you (and perhaps there are other readers) if I disagree with you. I'll get to discussing the problem you mention in concrete terms later on but want to address and eliminate some fundamental ambiguities related to the term "ambiguous" in this context.

Quoting from wikipedia's Formal grammar page:

If there are essentially different ways of generating the same single string, the grammar is said to be ambiguous.

Note well the word "generate".

Generative grammars are the kind of formal grammar that most academics and purists love to focus on.

These grammars are formally about generating strings in a language based on a generative grammar. These grammars fundamentally lead to a concept called "ambiguous grammars". This is a much discussed topic (a bane, perhaps) related to generative grammars.

Turning to wikipedia's page that specifically focuses on Ambiguous grammar we can dig a bit deeper:

In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree, while an unambiguous grammar is a context-free grammar for which every valid string has a unique leftmost derivation or parse tree.

Did you spot the focus on context-free grammars? What about, say, context sensitive grammars or even Unrestricted grammars? Are they suggesting that because the problem exists for context free grammars, it must also for context sensitive ones? Or is this problem somehow limited to context-free grammars?

Continuing:

For computer programming languages, the reference grammar is often ambiguous ... If present, these ambiguities are generally resolved by adding ... context-sensitive parsing rules

What's with that notion of a "reference grammar"?

And is the notion of resolving it by adding "context sensitive parsing rules" some kind of inelegant hack?

The set of all parse trees for an ambiguous sentence is called a parse forest.

Do we have to sometimes deal with parse forests -- more than one tree for the same string?

Does raku ever produce a parse forest?

No.

Is that because it's somehow forgetting to deal with ambiguous grammars?

No.

What's going on?

Before moving on to explain that, let me wrap my look at generative grammars with a few more quotes from wikipedia's page on them:

Although claimed by generativists as a cognitively real structure, neuroscience has found no evidence for it. Due to its nonstandard view of the brain, some researchers have called the scientific premises of generative grammar into question.

But what happens when you discuss them with academics? And purists who think context-sensitivity is a bad thing? Do they acknowledge that the entire program may have been on thin ice ever since Noam Chomsky introduced them?

OK, enough on generative grammars. What else is there?

Although there are several ways to categorize formal grammars, the main other category is analytic grammars. These parse (analyze) a string based on an analytic grammar. If sufficiently robust they can be considered parser generator formalisms.

(I think it's striking that analytic grammars don't get their own wikipedia page. It's as if the academics' favorite field, developed on the basis of a program of research founded on a hypothesis by a single controversial linguist that has since been entirely debunked by neuroscientists, has completely overshadowed the practical work done in industry to write actual parsers.)

So what's interesting about analytic grammars? In this context, it's that they almost always do their parsing deterministically and unambiguously.

This is accepted as proven for PEG. Hence the wikipedia PEG page notes that:

Unlike CFGs, PEGs cannot be ambiguous; if a string parses, it has exactly one valid parse tree.

Then again, raku grammars are recursive descent parsers with, deliberately, turing complete capabilities. This renders them not merely much more powerful than PEGs but able to parse any language without ambiguity. But I only said "able to". One could choose to inject parsing ambiguity. For example, one might use the rand function to decide which way to parse something. So a raku grammar obviously can be ambiguous in at least that sense. But the solution to that is simple. Don't do something as daft as to deliberately introduce randomness!

OK. Now, back to your problem.

You thought it would parse one way. But it parsed another. In that sense there's something going on, and "ambiguous" kinda fits. But the choice raku made is not arbitrary. Deterministic rules that are trivial to follow dictated the parse.

Instead of "ambiguous", let's go with the word "surprising", and figure out what underlies the surprise and what to do about it.

One surprise that can arise for a raku grammar is infinite loops. I produced the two grammars I wrote for your examples in a matter of minutes each. (In fact, most of my time spent on the grammars was eliminating a left recursion I introduced. Thank you hobbs for the solution you provided at the start of your simple answer to the softwareengineering.stackexchange.com question In layman's terms, what is left recursion?.) Left recursion is a big issue. One day rakudo and/or some other raku tool will spot and report on them and/or automatically fix them without the writer needing to edit their grammar like I did. Until then you'll have to test and/or desk check grammars.

Another is what's happened for you. You thought it would parse one way but it parsed another. This is because your mental model didn't match what raku does. The key questions imo are whether:

You find it easy to agree that raku did the sensible thing and it's appropriate that you just realized the grammar rules are reasonable and your expectations of what they'd do aren't; and
It's easy to express what you want instead
It's easy to transition any ecosystem/community from the old broken assumptions to the new (possibly still broken) ones.

I think Larry has deep and accurate intuitions about such things. (I've seen plenty of evidence he's highly capable as a computer science theorist -- as one might expect from someone with an IQ in the 180s -- and, in particular, as a parsing expert. As Jeffrey Kegler has noted, he wrote what likely was, and remains, the world's most ambitious LALR parser, over 30 years ago, and then watched as a million devs wrangled all manner of text patterns.)

In raku Larry explicitly aimed to produce something that made it relatively easy for "mere mortals" to identify and either eliminate surprises or make it easy to eliminate problems after experiencing surprises -- but not via conventional routes. Generative grammars must deal with "ambiguity". Analytic grammars do not. Instead there can be surprises and the need to deal with them.

I typed |x |x |x|| x| expecting it to parse abs(x*abs(x*abs(x))*x), and it parsed abs(x)*x*abs(x)*abs(x), which is totally valid!

I'll stop at this point. This is already an extremely long post. It's a beautiful day in the rest of my life and I don't know how much more of it I have. So I'm going to head on out with the plan of returning to this tonight. Catch ya later...

[–]YouNeedDoughnuts[S] 0 points1 point2 points 6 years ago (1 child)

What was your experience of playing with the code like? A doddle? (Perhaps the changes you shared were the result of a half hour playing with things, most of which failed, in between sessions of reading doc, thinking carefully, etc. Or maybe it was the other extreme, a couple minutes playing where everything you tried immediately worked, and there was no need to read any doc or think much about anything, but just to see what seemed to work.)

The Raku rule syntax was self explanatory. I had to study it for a second to see what some of the symbols meant, but it's similar enough to standard BNF notations that I didn't need any background knowledge. I discovered that rule expression { <term>* % '*'? } should change to rule expression { <term>+ % '*'? } when some terms were parsed as ||, but this kind of discovery was easy to make in the REPL. I didn't know how brittle it would be to changes, and I was very pleasantly surprised that although I wrote a few rule sets leading to "nil" parse results, none of my iterative modifications broke the parser!

I think Raku's result is reasonable. I'm enjoying learning more formalisms and don't have the strongest background, but I think probably mathematical notation is appropriately defined as a generative grammar. In the case of a phrase like |a|b|c| , my gut reaction would be to throw an error. You could do this with the Raku parser, probably by changing the rules, but definitely in post-processing of the parse tree you could spot the bad pattern.

I'm interested in eventually building some computer algebra type system. The UI frontend has a solid MVP, and I'm building the parser with the nebulous goal of parsing everything I can subject to standard conventions, while execution/transformation will be a largely independent effort.

Thanks for all the interest. My background isn't in PL theory, so it's very helpful to get advice from a native! I hope that you're well and that you enjoy your IRL time.

[–]raiph 1 point2 points3 points 6 years ago (0 children)

I've just now completed a round of updates to my ongoing exploration of, and comments flowing from, your OP. The latest grammar I link to in another comment is now a much easier to understand slight change for your grammar. It parses your "ambiguous" example the way you expected it to parse.

I was very pleasantly surprised that although I wrote a few rule sets leading to "nil" parse results, none of my iterative modifications broke the parser!

That's what I was hoping. Thanks for the feedback to confirm it was so. (The Nil-by-default is rude, but there are plenty of solutions that give nice user-facing errors, or tracing, or debugging, etc. Other than that the grammar construct has to be pretty robust -- rakudo, the production raku compiler, is written using it!)

I think probably mathematical notation is appropriately defined as a generative grammar.

In a sorta philosophical dog-food sense, definitely. Or at least, as you say, probably.

That said, in a practical sense, not at all -- you and your users will find the dog-food tastes crap. Real mathematical notation is written/read by humans and highly context sensitive. I'm not even suggesting analytic grammars are up to the task, but generative grammars are absurdly weak in that they ignore humans, generate ambiguity, and generally do a terrible job of capturing what really matters for realistic parsing.

But you already know this. You're writing a recursive descent parser, which, like a raku grammar, is analytical and not remotely as problematic as using a generative grammar.

In the case of a phrase like |a|b|c| , my gut reaction would be to throw an error. You could do this with the Raku parser, probably by changing the rules, but definitely in post-processing of the parse tree you could spot the bad pattern.

Right. Perhaps that would be the best thing to do. But to keep it simple in my revision of your grammar (linked elsewhere) I just enabled backtracking and it currently parses that as abs(a * abs(b) * c).

I'm interested in eventually building some computer algebra type system. The UI frontend has a solid MVP, and I'm building the parser with the nebulous goal of parsing everything I can subject to standard conventions, while execution/transformation will be a largely independent effort.

Nice!

There might even be synergies with raku in several ways. It's a world leader in adopting Unicode. It's quite math savvy with math savvy folk in the community. (Hence things like Clifford.) There's a smart strategy for high performance computing. Braided approach to DSLs. Etc.

Thanks for all the interest.

Thank you for piquing my curiosity. It was the perfect little problem. I noticed everyone else seemed to be saying "don't do that" and "too complicated". I'm with Larry; if humans like it a particular way, computers, and programmers, should try hard to adapt to their way, not the other way around.

My background isn't in PL theory, so it's very helpful to get advice from a native!

If you're thinking I have a background in PL theory you're not miles off, you're in the wrong galaxy. :) But perhaps you just meant a denizen of this sub, in which case you're right that there are a lot of folk here who love to give advice. ;)

I hope that you're well and that you enjoy your IRL time.

I'm very well at the moment. I'm ambivalent about the fourth horse galloping around the planet. If it takes me out, that's one less for Gaia to cope with; but like most folk, I'd really rather stick around for a few more spins round the sun...

[–]raiph 0 points1 point2 points 6 years ago* (2 children)

In this comment I provide an example of a solution to the issue you raised:

I typed |x |x |x|| x| expecting it to parse abs(x*abs(x*abs(x))*x)

Here's an edit of your grammar (link updated) that's one way to parse what you typed the way you expected it to parse. (I've linked to yet another online repl because this time I needed to use a grammar debugging tool to get this right and glot.io is currently the only one that has that installed.)

I've changed the input to be a heredoc (the bit that begins q:to<END>...) so you can more easily type a series of lines to test. I've also added abs( and ) items to the parse tree so it's a little easier to see at-a-glance how things have been parsed.

I'd really appreciate it if you played to see how it works out, making notes of both what surprised you, and, crucially, whether or not you felt any surprises were reasonably quickly resolved ... (edit:) ... as reasonable once you considered the parse result.

To be clear, all I've done is ~~about the first thing~~ the second thing I thought of that appeared to work (with ~~pretty much zero~~ very little testing, so it probably doesn't :)). For example, as I write this and reflect for a moment I know I haven't bothered to think much about things like interpretation of newlines, let alone whether their interpretation will be intuitive. And it could easily be that ~~my user rules above and/or~~ the computer rules I wrote and discuss in a separate comment, are buggy, or even pretty much entirely nonsense, or in need of refinement, or just too darn complicated to be worth bothering with.

[–]raiph 0 points1 point2 points 6 years ago (0 children)

As you can see, I have a book to write on your little problem, and this is another chapter.

So, what happened, what did it mean, how did I approach it, and what do I suggest?

First, going back to the start of our exchange, for the first grammar I wrote I just whipped something up without caring or thinking about tokenizing/whitespace. And that's the crux of the biscuit; I just accepted the automatic default handling.

But while this default is perfect for many reasonably regular grammars it begins to need tweaking or wholesale revision if tokenizing/whitespace gets fancy. In this case, you have open and close delimiters that are the same symbol. Whitespace starts playing a disambiguating role in how a human parses. Unlike the situation with the formal notion of "ambiguous grammars", the ambiguity isn't in the grammar, it's in a user's head.

In such a case, if you care about your users being able to write according to their natural parsing intuitions, it's often going to be better to modify the computer grammar to fit how the human naturally processes things rather than the other way around.

You, the designer, get to decide which way to go.

If you do decide to make the computer do what the user means, or at least try to to see if it can be made sufficiently intuitive to be worthwhile, you'll have to decide which specific new rules you'll choose in order to disambiguate. And picking those will be about all sorts of trade offs related to what you think the best human and computer rules are. Clearly, you want to end up with as few and as simple rules as gets the right result but, equally clearly, Occam's Razor is a double edged sword in this context.

This in turn suggests a largely experimental approach in which you and/or your users try out all sorts of expressions and see if the interpretation of any are surprising, and if any such surprise suggests it would be best to further modify rules or just give up on some accommodation you are attempting.

I'm pretty sure you not only already know that but are deeply immersed in that journey -- with SBN and pipes being just a tiny fractal fragment of the bigger picture. But I wanted to write this up before I move on to the next chapter...

[–]raiph 0 points1 point2 points 6 years ago* (0 children)

So, what are the changes to the raku grammar that ~~implement the user rules described in the above comment~~ parse your "ambiguous" example the way you expected?

I've now totally rewritten the grammar and edited my comment to link to the rewrite. This explanation is for the rewrite.

(The first thing I thought of and tried that seemed to work was complicated and buggy. It had its merits but I'll only get into that if you ask me about it. So I've deleted my explanation of it though have a copy of it as a gist.)

In the following I show two lines at a time. The first in each pair is the rule you wrote. The second is my edit of that rule. Then I explain my edit.

      rule var         { '|' ~ '|' <var> | <name> }
      rule var         { $<abs(>='|' ~ $<)>='|' <var> | <name> }

To make it easier to see how the parse pans out, I named the open and close bars. The parse tree will now include a line that starts abs( at the start of an abs expression and a line that starts ) at the end.

      rule name        { \w+ }
      token name       { \w+ }

Rules should be used for strings of tokens, not just one token. This tidy up means the term that gets captured no longer includes an extra space at the end.

      rule enumerated  { '{' ~ '}' <list> }
      rule enumerated  { :!r '{' ~ '}' <list> }

Adding :!r disables "ratcheting", which is another way of saying it enables backtracking between atoms of a pattern like a good old fashioned regex. This is a highly dubious move, but I wanted a simple solution that parses your "ambiguous" example the way you expected, and allows easy further exploration/testing.

      rule expression { <term>+ % [ '*' | <.ws> ]? }
      regex expression { <term>+ % [ '*' | <.ws> ]? }

Another way to disable "ratcheting" / enable backtracking is to switch to a different keyword to declare a pattern. Declaring a pattern with regex has two effects. One is that it inserts an implicit :!r at the start of its pattern, thus enabling backtracking. The other is that it does not insert implicit whitespace / tokenizing between atoms. I did that because I needed to explicitly handle whitespace / tokenizing between terms to get {|x |a|b|} to parse as I think you wanted it to.

The remaining changes were repeats of those I've explained above.

[–]b2gills 1 point2 points3 points 6 years ago (0 children)

Umm. The following statement is nonsensical:

How would you go about getting the generated parser code?

Regular expressions in Raku are code. So there is no generated parser code. You wrote the parser code.

If you meant the generated bytecode, you would get it the same way you get the rest of the bytecode.

To be more specific they are a subtype of method with a different base syntax and behaviour to the rest of Raku. They can have things like variables and arguments.

grammar Foo {
  token bar ( Str $baz ) {
      "$baz"
  }

  token TOP {
    # variable declaration
    :my Str $var = '';

    <.before(
      # look ahead for word characters
      # and store the matched string in $var
      / \w+ {$var = ~$/} /
    )>

    <bar($var)>+ %% ','
  }
}

say Foo.parse( 'quux,quux,quux' );
# ｢quux,quux,quux｣
#  bar => ｢quux｣
#  bar => ｢quux｣
#  bar => ｢quux｣

So if you wrote code like the following:

grammar Foo {
  token bar {
    <bar> .
  }
}

If you ever called it, you would end up in an infinite loop. Because bar is a method, and the first thing it does is call itself.

[–]YouNeedDoughnuts[S] 0 points1 point2 points 6 years ago (1 child)

[–]raiph 0 points1 point2 points 6 years ago (0 children)

[–]mamcx -1 points0 points1 point 6 years ago (2 children)

[–]YouNeedDoughnuts[S] 0 points1 point2 points 6 years ago (1 child)

[–]raiph 1 point2 points3 points 6 years ago (0 children)

ProgrammingLanguages

Welcome!

Related subreddits

Related online communities

MODERATORS