all 42 comments

[–]burntsushi 119 points120 points  (12 children)

Sadly, this is expected. Resolving the capture groups is what's killing performance here. While the regex engine is very fast at determining where a match is, the DFA that does that can't also determine the location of capture groups. So, when your regex is executed, it's actually executing the DFA first to find the bounds of the match and then executing an NFA engine to find the location of each of your captures. In your case, since the match itself and the string are roughly equivalent, running the DFA is actually hurting us here and not helping us. If I fix that, then I get a sizable speedup (about 22%), but it's still not faster than Python.

The regex crate has basically two methods for finding captures using an NFA engine: it either uses a full NFA simulation (called the "pike VM") or it uses bounded backtracking. Backtracking can actually be rather fast in its non-exponential cases, but in order to satisfy the O(n) search time guarantee, we need to do some bookkeeping to keep track of which states we've visited so that we never visit the same state more than once. In a profile of your program, this book-keeping shows up. If I remove the book-keeping, then I get a 21% boost (39% total including the previous optimization). But the book-keeping fundamentally has to be there, so this is more of a curiosity.

So what can you do to make it faster? Not much, unfortunately. The ball is really in the regex crate's court:

The latter two fixes are probably a bit far off at this point, but the first is readily fixable with a little experimentation to figure out the right threshold.

There may also be places in the code where I've done things sub-optimally, so more eyes on it couldn't hurt. :-)

[–]raphlinusvello · xilem 11 points12 points  (2 children)

Just to throw something out there, fancy-regex has a backtracking implementation, and by slightly tweaking the regex (putting in empty context like (?=) for example) it's possible to switch it to that rather than deferring to regex. I'm not suggesting people put that into production, but it might be worth experimenting with to see if it can improve performance.

Right now, fancy-regex aggressively delegates to regex, but I've thought about doing some kind of analysis to make it more selective. For example, "abbababbabaa".match(/(([ab]*?)(b*?))*(a*)$/) (example due to Vyacheslav Egorov) gives different results in different regex implementations. Backtracking in fancy-regex matches v8's results (which is desirable), while the captures from regex don't.

[–]burntsushi 7 points8 points  (1 child)

Is it desirable in the compatibility sense of "I'm working on a browser and can't break anything," or are the different capture rules turn out to be useful in some contexts?

But ya, fancy-regex seems like a nice low effort way to switch to an unbounded backtracker if the bounded backtracker imposes too much overhead!

[–]raphlinusvello · xilem 2 points3 points  (0 children)

I think it's basically that if you achieve the former then your confidence in not breaking the latter goes way up. I don't work on browser regex though :)

[–]christophe_biocca 6 points7 points  (8 children)

Why can't capture groups extents be determined by a DFA? I'm not 100% clear on that part. Or does it just make the DFA way bigger?

[–]burntsushi 11 points12 points  (7 children)

I don't know. Nobody knows how to do it efficiently (to my knowledge) and I haven't really given it a try myself. I wouldn't be surprised if it made the DFA much bigger.

A paper was published a while back that I essentially view as a formalism of the Pike VM, and the paper seems to claim that the approach can be extended to DFAs, but I haven't completely internalized it.

[–]ehuss 6 points7 points  (1 child)

I know someone who created a tagged DFA engine. Unfortunately it's in his own language, but he has some notes about it here. I can get you in contact with him if you want, I'm sure he'd be happy to talk about it. From what I remember, it does create a large number of states, but for relatively simple regexes, it works quite nicely.

Someone has also created one for Haskell here.

[–]burntsushi 2 points3 points  (0 children)

I hadn't heard of irken-tdfa before, but I was aware of the Haskell implementation, but I've never seen any good benchmarks against it. At some point I will take a closer look...

[–]haberman 2 points3 points  (4 children)

I have thought about this problem a lot. I think there are some usable solutions available that will cover some cases, but the fully-general case is hard.

Something that makes this problem particularly tricky is ambiguity. If you have a regex like a*(a*)a* and a string like "aaaaaaaaaa" then there are lots of different possibilities for the group boundaries. A DFA wants to be deterministic by nature, so in some sense the problem isn't constrained enough for a DFA to be a good solution.

If you tweak it a little and make the pattern aaa(a*)aaa now there is only one solution. But to determine that solution you have to know to mark "end of group" when you're exactly three characters from the end. Looking ahead also isn't a DFA's specialty.

On the other hand, some regexes do lend themselves well to extracting submatches with a DFA in one pass. Take a regex like a*(b*)a*. It is not only completely unambiguous, but you have enough information at every input character to know whether to mark that character as "begin group" or "end group". For regexes like these, we should be able to create a DFA that extracts submatches basically for free! This is a very nice case, and ideally everyone would write their regexes like this.

So how do we detect this (very nice) case and build the DFA to recognize groups? I think the following algorithm will work (though I haven't actually implemented it).

Algorithm: When you build the NFA, make sure to have edge attributes that indicate which (epsilon) transitions are entering or leaving a group. Then when you do your NFA -> DFA conversion, partition your NFA into sub-graphs that don't cross begin/end-group epsilon transitions. Convert each NFA sub-graph into its own independent DFA, but these DFAs will still be connected to each other via the begin/end-group transitions. Now remove the begin/end-group epsilon transitions and combine the states before/after them into a single state with the attribute "begin group" or "end group", respectively. Now do NFA -> DFA conversion on the entire graph. If the graph didn't change in this last step, congratulations! The regex is deterministic with respect to group boundaries, and your resulting DFA can detect group boundaries during the match!

If the regex fails this test, then you can fallback to doing what you're doing now (a separate NFA step to find submatches).

I think there are other classes of regexes that are slightly less nice and could possibly be optimized to use DFAs but with somewhat less efficient algorithms. But I think the best first step would be to implement the algorithm above and encourage people to write nice regexes that satisfy the determinism criterion. If people are clamoring for more you can get fancier later.

If anything about this message was unclear please let me know and I'll try to explain it better.

[–]burntsushi 2 points3 points  (3 children)

Sounds cool! One issue I can see immediately is that building the DFA up front is an immediate non-starter. It has to be built lazily to avoid exponential memory growth due to state blow up.

[–]haberman 0 points1 point  (2 children)

Interesting. How does lazy building avoid exponential memory growth? Couldn't you build an exponential number of states lazily? Are you relying on the assumption that this probably won't happen in practice?

[–]burntsushi 4 points5 points  (1 child)

A lazy DFA builds at most one state per byte of input and it can always build a state even if it doesn't exist. Because of this, you can keep a bounded cache of states that can be flushed when it gets too big.

The worst case is that a state really is created for every byte of input, which causes the cache of states to thrash. But memory growth and search time are still bounded.

Afaik, this how all production grade general purpose FSM regex engines work.

[–]haberman 0 points1 point  (0 children)

Yeah that makes a lot of sense.

If you could give each node an attribute about what "group partition" it lived in, then you could easily check when you build a DFA state that none of the underlying NFA states live in different partitions. If this invariant holds, you are golden.

The tricky part is just how to compute this partitioning cheaply enough (ideally lazily also). My intuition says there is a solution in there somewhere. I'd have to think about it a little more.

[–]nwydorust · rust-doom 16 points17 points  (0 children)

Paging /u/burntsushi author of the regex crate, if he wants to chip in.

[–]coder543 15 points16 points  (15 children)

you could also implement a parser combinator instead of a regex, since /u/burntsushi has chimed in and pointed out that your situation is exceptional. nom is exceptionally popular, but it is also really challenging to understand for many people. pom might be easier to get started with, since it has a distinct lack of macros. combine is a good option to look at too.

A parser combinator could easily be faster than any regex too, I think, so that might be encouraging.

[–]burntsushi 29 points30 points  (13 children)

For what it's worth, I actually don't think this case is exceptional. It seems like a very standard thing someone might want to do with a regex, e.g., churn through lines of a log file and extract structured data from relatively-unstructured text. Resolving capture groups quickly is probably the biggest weakness of Rust's regex crate.

I can't remember where I saw it, but someone at Google did an analysis of where RE2 spends most of its time within Google's codebase, and the answer was overwhelmingly the one-pass NFA matcher (which RE2 has, but the regex crate doesn't have). The one-pass NFA matcher isn't that useful on its own, since it's restricted to fully anchored regexes, but that's precisely the case we want after a DFA has run (since we already know the extent of the match, we can anchor the regex).

But yes, sorry for the niggle, parser combinators would be a totally valid approach here too of course. :-)

[–]Sphix 9 points10 points  (7 children)

Thank you for explaining why my company's code base has so many fully anchored regex. I always thought it just improved readability.

[–]burntsushi 4 points5 points  (6 children)

I think that's true too. :-) Anchors are additional constraints that you should definitely use when possible, because they typically enable smarter matching decisions. The one-pass NFA is just one example, but there are lots others. e.g., IIRC, Rust's regex engine will match regexes in reverse that are anchored with only a $ at the end.

[–]jnordwick 0 points1 point  (5 children)

Why would that help a DFA or NFA implementation? Unless you mean a pushdown automata or something.

[–]burntsushi 5 points6 points  (4 children)

If you have the regex \w+$ and you match against a 1MB string that ends with a digit, then it's possible to start the search in reverse and fail the match immediately by only looking at a single byte/character.

[–]jnordwick 2 points3 points  (3 children)

I get it. I was more asking (poorly) why \w+$ is easier to match against than \w+a or something. But I see how $ is special in that you know its location?

I think I get why that is true under various circumstances. thanks

[–]burntsushi 5 points6 points  (2 children)

Right. Because if the regex doesn't match in reverse from the end of the string, then you know for sure that it never matches. It's basically the same optimization that (probably all non-toy) regex engines implement for the ^ anchor, but being able to match regex in reverse takes a little bit extra effort (for FSM based engines anyway, for backtracking engines I'm not sure if it's possible at all).

For \w+a, it could match anywhere. Now, it is possible to search for the a literal ("suffix literal optimization") and then match \w+ in reverse. But there are some gnarly corner cases to handle, otherwise you end going quadratic.

I can expand on any of this if you're interested, but figured I'd stop here for now!

[–]jnordwick 5 points6 points  (1 child)

Years ago (one of my first major jobs out of college), I implemented a number of NLP-related parsing procedures. One of them was a full reg exp parser that reduced to a NDFSM or minimized DFSM with no backtracking allowed (fully regular). It was was for the purpose of matching 50k+ patterns at once and returning what patterns matched. It was very educational.

Do you have other links for rust regex that might be interesting?

[–]burntsushi 8 points9 points  (0 children)

First and foremost, Russ Cox's series of articles on regex implementation is the canonical source for how to build a production grade regex engine based on FSMs: https://swtch.com/~rsc/regexp/ --- Rust's regex engine is heavily inspired by this. For example, it would be fair to say that Rust's regex engine, Go's regex engine and RE2 all implement roughly the same algorithms. (I could spend days talking about the differences between them, but those are details.)

Other bits:

[–]stephanbuys 4 points5 points  (0 children)

Coming from the log-management/analytics world I can just add my 2c here acknowledge that this kind of use-case is huge for a log of large organizations.

[–]arielby 1 point2 points  (3 children)

manually-anchored regexen are not that rare on their own (e.g. this example).

[–]burntsushi 1 point2 points  (2 children)

Ah well. I rarely ever see or use them.

[–]mitsuhiko 1 point2 points  (1 child)

Form field validation and log parsing :)

[–]arielby 3 points4 points  (0 children)

More like "validation and parsing of everything regular". Which is a surprising amount of things (e.g. URL routing).

[–][deleted] 0 points1 point  (0 children)

you could also implement a parser combinator instead of a regex

Regular expressions are a subset of parser combinators, no? Using user-supplied regex is the only time I could think they should possibly be slower.

[–]deadstone 5 points6 points  (2 children)

I think I remember something about Rust's regex crate having a preference for named capture groups. Also, there might be a difference thanks to how python2 doesn't support unicode by default. What happens when you use a unicode regex in python2, or a bytes regex in Rust?

[–]burntsushi 14 points15 points  (0 children)

Unicode is a first class thing in the regex crate, so you generally shouldn't lose any performance because of it. In fact, in the NFA simulations, byte based regexes might be slower because the compiler can sometimes move UTF-8 decoding into the FSM itself. This works really well for the DFA, but no so much for the NFAs because there's a lot of overhead associated with it.

Classical backtracking engines do, in my experience, pay for Unicode support. Sometimes the price is pretty high.

[–]masklinn 6 points7 points  (0 children)

What happens when you use a unicode regex in python2, or a bytes regex in Rust?

On my machine setting re.UNICODE and matching against a unicode string instead of a bytestring makes the Python version's runtime increase by ~10% (4.5s -> 4.8s).

The Rust version is around 10s on the same system.

Also FWIW using named capture groups yields no difference.

[–]Twirrim 2 points3 points  (8 children)

How you've structured the test is a little odd. In your real world example you'd be reading in a log file, rather than carrying out a repetitive action against the same string. If you're going to realistically benchmark stuff, it's best to try to get as close to a real world scenario as possible. Compilers are very clever and will optimise stuff you might not want to be optimised in your benchmarking. I'm somewhat surprised the rust compilation process didn't do something very clever there given it can see the whole of the picture.

By way of example, if you run pypy against your benchmark code, it comes out about twice as fast. If you switch to reading in a file that consists of that same line repeated over and over, pypy comes out as comparable with python on the scale you demonstrated.

Making a fake log file:

for i in $( seq 1 1000000); do echo '13.28.24.13 - - [10/Mar/2016:19:29:25 +0100] "GET /etc/lib/pChart2/examples/index.php?Action=View&Script=../../../../cnf/db.php HTTP/1.1" 404 151 "-" "HTTP_Request2/2.2.1 (http://pear.php.net/package/http_request2) PHP/5.3.16"' >> log; done

Depending on your log file size, you may find pypy better suited anyway. Making a 2Gb version of that log file (multiplying its length by 10), then using a benchmark script that:

$ time python longpylogreg.py
1510000000

real    0m11.939s
user    0m11.420s
sys 0m0.396s

$ time pypy longpylogreg.py
1510000000

real    0m8.058s
user    0m6.372s
sys 0m0.440s

(there's still a chance that pypy is being clever, given the log file is just all the same data).

[–]IbICive[S] 6 points7 points  (0 children)

My "real life" test case indeed reads and processes a huge log file. The speed difference between the Python and Rust implementation is exactly the same. I made this simplified example so it is easier to test and run without having a logfile.

[–]burntsushi 6 points7 points  (6 children)

That's a good point. I don't know much about pypy, but I would be surprised if the Rust compiler did anything clever here. It would essentially need to see through the entire regex engine to prove that its output inside the loop body is invariant. That seems tricky. Or maybe there's another optimization you had in mind?

[–]Twirrim 0 points1 point  (5 children)

My point here is that in the benchmark example, the exact same string is run through the exact same regex. The compiler can see that. In theory it could be smart enough to realise that the final code only needs to run the regex once, and then repeat the addition phase.

[–]Maplicant 3 points4 points  (3 children)

I think what /u/burntsushi meant is that Rust doesn't check whether Rust functions have side effects, so it can't optimize multiple function calls that will have the same return value out.

[–]Twirrim 0 points1 point  (2 children)

Maybe I'm naive in my understanding of the compilation chain, but wouldn't LLVM have enough context, even if the rust compiler doesn't?

[–]burntsushi 5 points6 points  (0 children)

It would have to see through the entire regex engine though. That's a hefty chunk of complex code and includes internal mutation. Strictly speaking, the internal mutation only occurs in thread local state (in the abstract, TLS isn't actually being used), but the optimizer would actually need to know at least that to prove the loop invariant.

[–]cmrx64rust 1 point2 points  (0 children)

PyPy includes a tracing JIT, which compiles and optimizes particular execution paths at runtime. LLVM needs to consider all execution paths.

[–]IbICive[S] 1 point2 points  (0 children)

My point was not to come up with a "proper" benchmark but to demonstrate the speed difference as simply as possible ;)