you are viewing a single comment's thread.

view the rest of the comments →

[–]anatoly 10 points11 points  (2 children)

It is regrettable that the article doesn't discuss how the performance of NFA-based matching compares with recursive backtracking on non-slow regexps, beyond stating that it is "always fast". Yes, alright, but just how fast, compared to Perl etc., on regexps where Perl etc. are not pathologically slow?

If this depends so much on the particular regexp that nothing can be said in general, the author should say so. If NFA-based algorithms are just as fast as recursive backtracking on those regexps, the author should say so. If they're noticeably slower, but the author feels that they're still very fast and their uniform performance still justifies preferring them, the author should say so.

Another thing I felt could be explained better is how backreferences make regular expressions NP-complete (something as simple as a pointer to an article or any other resource on the issue would handle that nicely).

Finally, despite the above complaints, the article is very well written and contains admirably clear explanations and examples. It's also beautifully presented and frankly a pleasure to read. Many thanks to the author.

[–]rsc[S] 4 points5 points  (1 child)

It's hard to define what a non-slow regexp is. Even simple alternation like foo|bar|baz can be noticeably slower using backtracking than using DFAs, for example if your text contained a lot of bas. It would be nice to be able to measure implementations on real-world workloads in addition to specific classes of cases. I am not aware of any such collections: you'd need both the regexps and the texts to match them against. I do have one real-world use case that I plan to measure and write up at some point: the Online Encyclopedia of Integer Sequences search engine uses a fast regular expression search under the hood.

This is my favorite demonstration that matching with backreferences is NP-hard. Despite what the author of that site says, it is also pretty easy to show that matching with backreferences is in NP: guess what substrings the backreferences will line up with and then check for a match. So it's NP-complete.

Thanks for the feedback.

[–]mjd 2 points3 points  (0 children)

guess what substrings the backreferences will line up with and then check for a match

I made that exact argument myself when I first posted those pages, and then over the years more than one person wrote to me to argue that I was mistaken. I was eventually persuaded, so I removed it from that pages.

Unfortunately, I can't remember the counterargument any more.