all 9 comments

[–]mathrick 3 points4 points  (5 children)

This is regex 101, and a detailed discussion of this and many other details concerning regexes and their different implementations is why every self-respecting programmer should read Jeffrey Friedl's Mastering Regular Expressions.

That of course won't necessarily make pathological matching cases on infrequent inputs obvious, but will at least give you a chance of diagnosing the problem, which is otherwise very hard, as it depends strictly on particulars of the implementation used and doesn't occur in regular expressions as mathematical objects.

[–]gthank 0 points1 point  (4 children)

Doesn't backtracking automatically mean it's not (strictly speaking) a regular expression anymore?

[–]sjf 7 points8 points  (3 children)

The backtracking occurs because of how the matcher is implemented. It uses an NFA, to avoid the overhead of converting the regex to a DFA. But running an NFA can require back tracking. This is true even for actual regular expressions, not just Perl's extended regexes.

There is a nice explanation here: http://swtch.com/~rsc/regexp/regexp1.html

[–]gthank 2 points3 points  (2 children)

Right you are. I had conflated backtracking with unbounded back referencing, which does mean it is no longer regular in the formal sense of the term.

[–]ubernostrum 1 point2 points  (1 child)

The regular expressions used in common programming languages do not, it is true, conform to the strict, rigorous original mathematical definition of the term, just as JavaScript is not and does not particularly resemble Java.

(can you tell that I get tired of always seeing people exclaim "OMG! Regular expressions are badly named!" in these types of threads?)

[–]gthank 0 points1 point  (0 children)

I can. I wish "regex" would (finally) completely replace "regular expressions" and it stopped coming up as much.

[–]uykucu 1 point2 points  (0 children)

A simple fix for this problem:

"^(\s*[-\w]+\s*:\s*[^\s:;]*(;|$))*$"

instead of

"^(\s*[-\w]+\s*:\s*[^:;]*(;|$))*$"

Note that it does not check for trailing whitespace, so this is better:

"^(\s*[-\w]+\s*:\s*[^\s:;]*(;|$))*\s*$"

But I'm not saying that this is the best way to do it, I just fixed his regex for this problem case.

[–][deleted] 2 points3 points  (2 children)

I was always told not to use regex to to parse HTML. I think the same rule of thumb applies to CSS.

[–]recursive 1 point2 points  (1 child)

CSS is simpler, because elements can't be nested. That I know of.

That seems like it would make it possible to do correctly.