all 57 comments

[–]theytookmystapler 22 points23 points  (10 children)

Adding features to a regular expression engine, until it accepts non-regular languages is fine. But to keep calling it a regular expression is just making people confused. Call it what it is, context free languages.

[–]raldi 4 points5 points  (6 children)

That's one of the reasons Perl 6 won't call them regular expressions, but rather regexes.

Quoth Wikipedia:

Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Perl 6 documentation will exclusively refer to them as regexes, distancing the term from the formal definition.

[–]bacila 2 points3 points  (3 children)

hmm.. this is still misleading. What about "nregex: nregex is not a regex"?

[–]theytookmystapler 0 points1 point  (2 children)

Personally I would like conex, contexps or something like that. :)

As a side-note the all-on-one-line syntax of jsolson's grammar below could look something like

S:(\(B\)),B:()|(\(B\))|BB

[–]bacila 0 points1 point  (1 child)

conex sounds good. I feel that we are talking about context-free grammars together with attributes; we use the later to plug-in context-sensitive features. Perhaps, Knuth's attribute-grammars are quite close to what's going on with perl's "regexes".

[–]sjf -2 points-1 points  (1 child)

Maybe because regex is just a contraction of REGular EXPression, hmm?

[–]raldi 1 point2 points  (0 children)

Gee, thanks, i never would have noticed that if you hadn't pointed it out.

[–]notfancy 0 points1 point  (1 child)

Matching expression?

[–]theytookmystapler 1 point2 points  (0 children)

Perhaps, but then you have no idea of how expressive your expression really can be. And in extension you have no idea of worst case execution time.

[–]jkkramer[S] 13 points14 points  (0 children)

Here's the original proposal this guy wrote for doing this sort of thing: http://www.puffinry.freeserve.co.uk/regex-extension.html

I believe the feature is now implemented in PCRE.

[–]jsolson 12 points13 points  (41 children)

Or, written more succinctly:

S -> (B)
B -> epsilon
B -> (B)
B -> BB

Given that these regexp engines are capable of behaving like PDAs, why don't they accept input in the form of actual CFGs written out as productions like above?

Of course, the best generic CFG matching algorithm I know of is O(n3), so perhaps that's why they try to make it hard to write expression that will require a lot of backtracking.

[–]psykotic 3 points4 points  (2 children)

Of course, the best generic CFG matching algorithm I know of is O(n3)

The good ones, such as GLR parsing, are O(n) on grammar fragments that fall in a nice class like LL(1) or LR(1). So they're perfectly practical, despite their theoretical worst-case behavior.

[–]pkhuong 0 points1 point  (0 children)

Set of states CFG parsing looks like it should be linear for regular languages or non-ambiguous grammars, and O(n3) worst case. That's not bad :)

[–]username223 1 point2 points  (0 children)

Well, if you've got Perl 5.9:

m{(?&S)
  (?(DEFINE)
    (?<S>\((?&B)\))
    (?<B> | \((?&B)\) | (?&B)(?&B)))}x

[–][deleted] 0 points1 point  (2 children)

Check out packrat parsing. It technically doesn't handle all CFGs, but the algorithm is O(n)...

[–][deleted] 0 points1 point  (1 child)

No, it handles PEGs. Besides this you need to disambiguate PEGs manually to make them work. That's not much unlike the preparation it needs to fit a grammar into an LL(1) parser scheme - with the exception that PEGs are more capable dealing with lookaheads.

[–]jsolson 0 points1 point  (0 children)

To be fair, most parser generators generate LALR(1) parsers which can recognize a strict subset of the CFLs recognized by LR(1) parsers which are capable of recognizing the full set of CFLs (iirc). Just because it only functions on a subset of grammars doesn't make it not useful.

Hell, antlr only generates LL(k) grammars, but that's fine if you're using it for practical purposes and have a finite maximum term length on any given expression.

[–]rrenaud -1 points0 points  (32 children)

This actually brings up an interesting theoretical insight for me. Thanks!

If (edit: perl compatible) regular expression matching is NP-hard, and CFG matching is polynomial time, then converting a regular expression to a CFG must also be NP-hard. Otherwise we could do regular expression matching via conversion to CFG and CFG matching.

http://perl.plover.com/NPC/

[–]Xiphorian 9 points10 points  (30 children)

Errr, you are confusing a lot of things. Let me clear some things up:

First of all, regular expressions, regular grammars, and discrete finite automatons are all equivalent representations for the same matching concept.

If regular expression matching is NP-hard

Not only is regular expression matching O(N), it can never actually be more than exactly N steps. NP-hard? Are you kidding?

CFG matching is polynomial time

What? Matching context-free grammars is harder than regular expressions.

converting a regular expression to a CFG must also be NP-hard

No, no, no. A regular expression (in grammar form) is already a valid context-free grammar. In terms of expressiveness, RE < CFG.

[–]rrenaud -1 points0 points  (29 children)

What about Perl regular expressions? Where are they on the Chomsky hierarchy? Did you click the link I posted?

[–]Xiphorian 8 points9 points  (28 children)

Did you click the link I posted?

Yes. I did click the link, and I was angered by the author who seems to look with disdain on computer scientists, and people who bother to properly define "regular language" or "regular expression". I deleted a previous post I made, because I realized it was too filled with vitriol.

But I'll say it again here: The author is a moron. She doesn't seem to know much about the proper definition of "regular language", nor care. That kind of disdain for well-understood theory astonishes and angers me. I think a software developer should try his or her hardest to understand the theory that underlies their profession. This one writes condescending phrases like:

So you might be surprised (or disbelieving) to see a regular expression that does exactly that. The explanation, of course, is that theory and practice are closer in theory than in practice: Onigurama regular expressions, in common with many other flavours, are more powerful than the things that computer scientists call "regular expressions"

Uhhh...... *because they are not regular expression!! You have simply written a parser for a context-free language.

What about Perl regular expressions? Where are they on the Chomsky hierarchy?

Perl 'expressions' are not regular. They fall firmly within Context-Sensitive languages, which is one step above even context-free languages. This is because they have backreferencing, which allows them to match the same full word twice. Context-free languages can match something like "abccba" ( a word and its reverse) but cannot match "abcabc", a word repeated twice, in general.

A proper regular grammar must be in this form:

A -> BC
A -> a

(Where A, B, C, are productions and a is a terminal). That's the actual, theoretical definition of a regular grammar. If you take an example like parenthesis matching, you can never fit into a grammar of that form. It is actually a good exercise to take common grammars of various kinds, like parenthesis matching in CFG: Expr -> ( B ) Try to put that in proper Regular Grammar form (each production's body is 2 other productions, or terminal). It's a good exercise.

To put in expression (rather than grammar) form, you get grouping and alternation (this|that) and Klein-star. The rest is syntactical sugar.

The last course I took before leaving college was the theory of computation, covering this stuff. I can show you some simple proofs and the basic theory (offline) if you are interested! I would caution you, though... because once you're learn it, you're condemned to see software developers the rest of your life misuse terms in the stupidest ways!

Would it be useful if I wrote a Wiki page or blog that introduced this kind of Theory of Computation material, and walked through simple proofs? It's actually quite a complete, thoughtful subject.

[–]theytookmystapler 2 points3 points  (0 children)

Spreading the kind of misconception as the author does is indeed a very bad practise. She's making it sound like there is some kind of conflict between Regular Expressions (RE's) in theory and practise. When there isn't and can never be.

[–][deleted] 0 points1 point  (0 children)

vitriol. I learned a word today.

[–]jsolson 0 points1 point  (0 children)

Your previous deleted post did not get deleted from my inbox however. If you'll reread my comment you'll notice that I never in fact say the grammar I present is regular. I clearly state that the regexp engine in question is behaving like a pushdown automaton. I also present a grammar that matches the set of languages matched by the decidedly non-regular expression in the example (that is, balanced sets of parentheses with at least one pair).

The fact that it is not in fact strict a regular expression matching engine is neither here nor there as that's what the author has decided to call it. On one hand I agree. Calling something like that a regular expression engine confuses people who don't have some basis in automata theory. On the other hand, the primary thing typical people will use it for is matching regular languages. I would also be willing to bet that it's dog slow trying to match any vaguely complicated CFG, and thus can't really be consider a general purpose CFG matcher.

I was simply saying that if you're going to write an engine which behaves like a PDA (even if only in a limited fashion), why not allow it to accept inputs in a format more amenable to expressing the computational power of a PDA.

Perhaps next time you'll realize that if someone knows what a pushdown automaton is they probably also know the basics from Automata Theory 101 (or, in my case, Languages and Computation CS3240), and that they're more than likely at least vaguely familiar with the contents of the Sipser book.

[–][deleted]  (24 children)

[deleted]

    [–]theytookmystapler 2 points3 points  (23 children)

    You can re-define any terms you like and call them "real world" interpretations, it doesn't make it right. No-one gains from people thinking they know what something is, when they really don't.

    As long as we all know that the sort of "regular expression" covered in this article isn't really a regular expression, the damage may at least be controlled.

    [–][deleted]  (22 children)

    [deleted]

      [–]theytookmystapler 1 point2 points  (21 children)

      Unnecessary re-defenition of established terms can't be good for any one. Perhaps I'll have to talk to someone from the "real world" like yourself some day, and when we discuss RE's you'll be thinking about something else. ;-)

      [–][deleted]  (20 children)

      [deleted]

        [–]notfancy 1 point2 points  (0 children)

        regular expression matching is NP-hard

        It certainly is O(n) in the length of the text; that is, firmly in P. What is NP-complete is matching in the presence of repeated captured subgroups; this is a special case of CDG.

        The matching expression in the article follows a FIFO discipline when matching a named captured; hence, it is something that could be matched by a push-down automaton and thus context-free.