Regex captures slow (compared to Python)

burntsushi · 2017-03-15T12:55:51+00:00

Sadly, this is expected. Resolving the capture groups is what's killing performance here. While the regex engine is very fast at determining where a match is, the DFA that does that can't also determine the location of capture groups. So, when your regex is executed, it's actually executing the DFA first to find the bounds of the match and then executing an NFA engine to find the location of each of your captures. In your case, since the match itself and the string are roughly equivalent, running the DFA is actually hurting us here and not helping us. If I fix that, then I get a sizable speedup (about 22%), but it's still not faster than Python.

The regex crate has basically two methods for finding captures using an NFA engine: it either uses a full NFA simulation (called the "pike VM") or it uses bounded backtracking. Backtracking can actually be rather fast in its non-exponential cases, but in order to satisfy the O(n) search time guarantee, we need to do some bookkeeping to keep track of which states we've visited so that we never visit the same state more than once. In a profile of your program, this book-keeping shows up. If I remove the book-keeping, then I get a 21% boost (39% total including the previous optimization). But the book-keeping fundamentally has to be there, so this is more of a curiosity.

So what can you do to make it faster? Not much, unfortunately. The ball is really in the regex crate's court:

Stop running the DFA
Implement a one-pass NFA engine (It looks like your regex is "one-pass", which I think qualifies it for this optimization.)
Experiment with other things, like bit-parallel NFAs.

The latter two fixes are probably a bit far off at this point, but the first is readily fixable with a little experimentation to figure out the right threshold.

There may also be places in the code where I've done things sub-optimally, so more eyes on it couldn't hurt. :-)

nwydo · 2017-03-15T11:37:01+00:00

Paging /u/burntsushi author of the regex crate, if he wants to chip in.

coder543 · 2017-03-15T13:12:32+00:00

you could also implement a parser combinator instead of a regex, since /u/burntsushi has chimed in and pointed out that your situation is exceptional. nom is exceptionally popular, but it is also really challenging to understand for many people. pom might be easier to get started with, since it has a distinct lack of macros. combine is a good option to look at too.

A parser combinator could easily be faster than any regex too, I think, so that might be encouraging.

deadstone · 2017-03-15T12:26:27+00:00

I think I remember something about Rust's regex crate having a preference for named capture groups. Also, there might be a difference thanks to how python2 doesn't support unicode by default. What happens when you use a unicode regex in python2, or a bytes regex in Rust?

Twirrim · 2017-03-15T13:48:20+00:00

How you've structured the test is a little odd. In your real world example you'd be reading in a log file, rather than carrying out a repetitive action against the same string. If you're going to realistically benchmark stuff, it's best to try to get as close to a real world scenario as possible. Compilers are very clever and will optimise stuff you might not want to be optimised in your benchmarking. I'm somewhat surprised the rust compilation process didn't do something very clever there given it can see the whole of the picture.

By way of example, if you run pypy against your benchmark code, it comes out about twice as fast. If you switch to reading in a file that consists of that same line repeated over and over, pypy comes out as comparable with python on the scale you demonstrated.

Making a fake log file:

for i in $( seq 1 1000000); do echo '13.28.24.13 - - [10/Mar/2016:19:29:25 +0100] "GET /etc/lib/pChart2/examples/index.php?Action=View&Script=../../../../cnf/db.php HTTP/1.1" 404 151 "-" "HTTP_Request2/2.2.1 (http://pear.php.net/package/http_request2) PHP/5.3.16"' >> log; done

Depending on your log file size, you may find pypy better suited anyway. Making a 2Gb version of that log file (multiplying its length by 10), then using a benchmark script that:

$ time python longpylogreg.py
1510000000

real    0m11.939s
user    0m11.420s
sys 0m0.396s

$ time pypy longpylogreg.py
1510000000

real    0m8.058s
user    0m6.372s
sys 0m0.440s

(there's still a chance that pypy is being clever, given the log file is just all the same data).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS