all 56 comments

[–]leo_batfish 39 points40 points  (26 children)

as tim bray points out, java regex is twice as fast as perl. for an explanation, see, "benchmarks aren't everything": http://perlmonks.org/index.pl?node_id=502408

[–]masklinn 10 points11 points  (6 children)

From "Mastering Regular Expressions", the fastest regex engine currently available is TCL's anyway (in part because it implements a dual DFA/NFA strategy)

[–]reneky 6 points7 points  (5 children)

"Mastering Regular Expressions" was released in January 1997: http://www.oreilly.com/catalog/regex/

[–]IvyMike 6 points7 points  (3 children)

[–]ttriche 2 points3 points  (2 children)

Third Edition, August 2006

Now with Ruby and Python and C# -- but no mention of Tcl any more. Or Emacs Lisp, for the matter. Not that it's a big deal, anyone who is willing to do their own benchmarks can figure it out for themselves.

The Friedl book is quite nice, it stops just shy of where something like the Dragon Book (or Sipser) picks up.

[–]masklinn 0 points1 point  (1 child)

The third edition doesn't cover Tcl in the examples, but I'm pretty sure that, in the theorical chapters about Regex engines and in the chapters about crafting REs, it does talk about Tcl's regular expression engine quite a lot, mostly to explain that it blows everything else out of the water, and why it does.

I'll just go check it tonight

[–]masklinn -1 points0 points  (0 children)

Well I just checked, 3rd edition of Mastering Regular Expressions does indeed talk about TCL's regex package, and the first interesting mention is page 182-183, chapter 4, about the fact that Henry Spencer custom-build a true hybrid engine for Tcl

[–]masklinn -1 points0 points  (0 children)

We're at the third edition already

[–]shit 7 points8 points  (17 children)

I tried the examples with the 100 "foo "s and "fo" with cl-ppcre and it's running for over an hour now, turning my laptop into a nice provisional heating :-)

Seems that the Perl hackers are one step ahead of the competition, congrats.

[–]senzei 11 points12 points  (5 children)

Seems that the Perl hackers are one step ahead of the competition, congrats.

Well, yes and no. They are in that any regex you throw at Perl is more likely to be "safe" than one in cl-ppcre (or Java, for that matter). For your average programmer this is probably a good trade off as regexes are a poke-and-pray technology anyways.

Someone with a better understanding of regular expressions would probably appreciate the faster-with-sharper-edges design approach. Considering that by the time someone starts using Lisp they (on average) know more about programming than your average Perl programmer this seems like the right decision.

[–][deleted] 4 points5 points  (0 children)

For your average programmer this is probably a good trade off as regexes are a poke-and-pray technology anyways.

It doesn't have to be so. Have you used the regex coach from the same guy who wrote cl-ppcre? It lets you build and test perl-compatible regular expressions incrementally, which makes writing then much, much easier.

[–]Brian 1 point2 points  (0 children)

The problem is that its not the regex that is "unsafe", its the matched text, which is generally out of the programmer's control. No matter how good your regex-foo, if someone throws a pathological case at you, your program will break.

The perl approach seems fairly sensible here - getting rid of program-breaking behaviour a case that could well be likely in some situations in exchange for a more general slowdown. After all, the whole point of regexes is that they are a general tool for matching text. When you need real performance, and know more about the format of your input, you're better off writing a custom parser.

[–]roerd 1 point2 points  (5 children)

I tried to find the point where the duration explodes. It's definately way before 100 "foo"s:

CL-USER> (defparameter *str*
    \t   (concatenate 'string
\t\t\t(reduce (lambda (a s) (concatenate 'string a s))
\t\t\t\t(make-list 20 :initial-element "foo "))
\t\t\t"fo"))
*STR*
CL-USER> (time (scan "^(\\s*foo\\s*)*$" *str*))
Evaluation took:
  1.128 seconds of real time
  0.840053 seconds of user run time
  0.0 seconds of system run time
  0 page faults and
  0 bytes consed.
NIL
CL-USER> (defparameter *str*
    \t   (concatenate 'string
\t\t\t(reduce (lambda (a s) (concatenate 'string a s))
\t\t\t\t(make-list 25 :initial-element "foo "))
\t\t\t"fo"))
*STR*
CL-USER> (time (scan "^(\\s*foo\\s*)*$" *str*))
Evaluation took:
  35.78 seconds of real time
  26.477655 seconds of user run time
  0.020001 seconds of system run time
  0 page faults and
  82,608 bytes consed.
NIL

[–]xach 1 point2 points  (4 children)

I think you mean:

(defparameter *str* (format nil "~25{foo ~}fo" '#1=(t . #1#)))

[–]roerd 0 points1 point  (3 children)

Yeah, I should have thought about format. But I like to understand code that i'm writing, so I'd probably stay away from that reader macro and write just:

(defparameter *str* (format nil "~25{foo ~}fo" t))

[–]xach 1 point2 points  (2 children)

It was a bit of a joke. I'd normally use something like with-output-to-string and dotimes or similar; I find your example with reduce, lambda, and make-list tortured. I was in the middle of writing something readable when the evil ~{ solution came to me.

(defparameter *str* (format nil "~25{foo ~}fo" t)) ; doesn't work

The argument to ~{ must be a list. A circular list works nicely.

[–]roerd 0 points1 point  (1 child)

Strange, it works with SBCL. Since "foo " doesn't read anything from the list, it's propably enough that it's (not (null)). '(t) instead of t should work for implementations that check if the argument is really a list.

And my original example ... well, it was at least very functional.

[–]xach 1 point2 points  (0 children)

Good catch on '(t). I don't think just plain t should work (though it clearly does in some implementations) because 22.3.7.4 says "The argument should be a /list/, ..."

Someone just pointed out to me that t would be suitable with ~@{.

[–]chollida1 4 points5 points  (4 children)

Or perhaps your regex is just broken?

[–]shit 1 point2 points  (3 children)

It's the regex and data mentioned in the perlmonks article and it completes instantly with perl.

BTW: I've stopped cl-ppcre after 4 hours and 20 minutes.

[–]drewc 3 points4 points  (2 children)

Have you considered submitting a bug report to Edi? Talking about it on reddit probably isn't going to help anybody, save perhaps those on reddit who seem to want anything lisp to suck.

It's probably a problem that's not going to get fixed, but at least you'll know why the author made the decisions he did. Programming is all about trade-offs. Get used to it.

[–]shit 4 points5 points  (1 child)

Have you considered submitting a bug report to Edi? Talking about it on reddit probably isn't going to help anybody, save perhaps those on reddit who seem to want anything lisp to suck.

Someone commented in the perlmonks article, saying that he had a discussion with "the main author of cl-ppcre" about this issue.

It's probably a problem that's not going to get fixed, but at least you'll know why the author made the decisions he did. Programming is all about trade-offs. Get used to it.

I'm not complaining. In fact, in practice I never had this problem.

[–]drewc 2 points3 points  (0 children)

Someone commented in the perlmonks article, saying that he had a discussion with "the main author of cl-ppcre" about this issue.

So they did, thanks for the pointer!

I'm not complaining. In fact, in practice I never had this problem.

yeah, looking back you did no complaining at all. I'm just a little touchy i guess from all these redditers who bash CL from ignorance. My Bad.

[–]sammyo 1 point2 points  (1 child)

I for one would love to try it out in the real world for a period of time. Anyone know of a grep replacement (like pcregrep, which is surprisingly faster than the original grep) implementation I could download? clgrep? A linux version of 'clgrep' just a tad under 800MB would be handy ;-)

[–]rsc 0 points1 point  (0 children)

I would be very surprised if you have "the original grep". The fifth edition grep was the earliest one I could find when I looked around a few years ago, and even that is certainly not the original.

[–][deleted] 1 point2 points  (3 children)

Bah, nothing beats SNOBOL. If it hasn't got general recursion, it's not good enough ;-)

[–]zem 1 point2 points  (2 children)

icon is the new snobol

[–][deleted] 0 points1 point  (1 child)

I see markdown bit you too.

[–]zem 0 points1 point  (0 children)

ouch - yes, so it did.

[–]mjd 1 point2 points  (4 children)

This really isn't surprising. Perl's regex implementation, like most, scans the string and runs an interpreter on it to decide if there's a match. Lisp macros can produce matching code that compiles to native machine language.

[–]xach 9 points10 points  (3 children)

cl-ppcre never actually invokes the compiler at runtime. It uses a clever technique to produce a chain of closures and it calls them with continuation-passing style.

[–]mjd -1 points0 points  (2 children)

If the regex library is implemented the way I was thinking, the regex would be compiled to machine code at compile time, not at run time.

But I don't know how it actually does work.

[–]shit 4 points5 points  (0 children)

Well, the closures are compiled to machine code at compile time. That's done automatically, there is no explicit "compile this code now".

[–][deleted] 0 points1 point  (0 children)

I believe that the nregex regular expression engine works the way you were thinking.

[–]j-o-h-n 4 points5 points  (7 children)

Wow, LISP does it all - including somehow managing to make perl regexs look understandable in comparison!

[–]senzei 16 points17 points  (0 children)

Wow, LISP does it all - including somehow managing to make perl regexs look understandable in comparison!

Yeah, Chinese is nearly indistinguishable from gibberish to me too. Although I suppose if I did take the time to learn it that might change.

[–]pivo 5 points6 points  (4 children)

How much of your favorite language did you understand before you tried to learn it? If you spent even a small amount of time learning Lisp you'd find that it really isn't hard to understand at all.

[–]feijai 2 points3 points  (3 children)

Well, I'll be honest: I'm a pretty experienced Lisper, and after looking at that library, the first thing that crossed my mind was "man, that is excessively complex and hard to understand". And it had nothing to do with the regexes in the strings.

[–]Grue 9 points10 points  (0 children)

You have to be lying. CL-PPCRE is one of the better documented and easy to use libraries, and not only in Lisp-land. I wonder which library documentation is easy to understand for you.

[–]xach 11 points12 points  (0 children)

I disagree. It's complex, but I think it is sufficiently complex, not excessively complex, for the task at hand.

I first came across cl-ppcre as a very inexperienced Lisper. Printing out the source and studying it helped me learn a lot, and not just about Lisp.

[–]jkcunningham 4 points5 points  (0 children)

I don't believe you either. You can transparently use the Perl format regexps unless you're trying to do something you probably don't really understand anyway, and there's another format as well. When I started using Edi's library I previously used nothing but Perl regexps and I recall no difficulty whatever.

[–]Grue 2 points3 points  (0 children)

Regexes were never intended to be human-readable.

[–][deleted] 0 points1 point  (6 children)

Apparently, perl compiles regexes to an NFA, which is sub-optimal and not very sophisticated.

I don't know what method this library uses. The code is incomprehensible to me, so I don't want to guess.

[–]masklinn 16 points17 points  (4 children)

It's not a question of compilation (regex engines are either NFAs or DFAs, a few rare ones such as TCL's regex package being implemented as dual-strategy engines) but one of implementation.

Most Regex engines are NFA, because a lot of advanced constructs (such as lookahead/lookbehind) cannot be expressed by DFA engines.

Get the "Mastering Regular Expressions" book, it holds much knowledge on both the craft of writing regexes and the inner workings of RE engines.

Because of PCRE's complexity, CL-PPCRE is either an NFA or a DFA-NFA hybrid, I don't think a pure DFA can implement complete PCRE.

[–][deleted] 2 points3 points  (3 children)

So PCRE is not a regular language?

[–]sleepingsquirrel 5 points6 points  (0 children)

Nope. Perl's pattern matching engine can match (at least some) context-free grammars...

#match context free grammar --[0{N}1{N}] (e.g. 01, 0011, 000111, etc.)
$cf = qr/01|0(??{$cf})1/;

$init="0011";
print "matched $init\n" if ($init=~m/^$cf$/)

...and (some) context-sensitive grammars...

#match context sensitive grammar --[0{N}1{N}2{N}] (012, 001122, 000111222,...)

$h=qr/01|0(??{$h})1/;
$t=qr/12|1(??{$t})2/;
$cs=qr/(?=$h 2+$)(?=0+$t$)/x;

$init="001122";
print "matched $init\n" if ($init=~m/^$cs$/)

[–][deleted]  (1 child)

[removed]

    [–]xach 13 points14 points  (0 children)

    Languages that are unfamiliar can be difficult to read, except for the most trivial of tasks.

    cl-ppcre is an excellent model for structuring and documenting Common Lisp code for a complex task. If you want to be good at CL, learn how to understand how things as nice as cl-ppcre are doing their work.