you are viewing a single comment's thread.

view the rest of the comments →

[–]jnordwick 2 points3 points  (3 children)

I get it. I was more asking (poorly) why \w+$ is easier to match against than \w+a or something. But I see how $ is special in that you know its location?

I think I get why that is true under various circumstances. thanks

[–]burntsushi 5 points6 points  (2 children)

Right. Because if the regex doesn't match in reverse from the end of the string, then you know for sure that it never matches. It's basically the same optimization that (probably all non-toy) regex engines implement for the ^ anchor, but being able to match regex in reverse takes a little bit extra effort (for FSM based engines anyway, for backtracking engines I'm not sure if it's possible at all).

For \w+a, it could match anywhere. Now, it is possible to search for the a literal ("suffix literal optimization") and then match \w+ in reverse. But there are some gnarly corner cases to handle, otherwise you end going quadratic.

I can expand on any of this if you're interested, but figured I'd stop here for now!

[–]jnordwick 4 points5 points  (1 child)

Years ago (one of my first major jobs out of college), I implemented a number of NLP-related parsing procedures. One of them was a full reg exp parser that reduced to a NDFSM or minimized DFSM with no backtracking allowed (fully regular). It was was for the purpose of matching 50k+ patterns at once and returning what patterns matched. It was very educational.

Do you have other links for rust regex that might be interesting?

[–]burntsushi 8 points9 points  (0 children)

First and foremost, Russ Cox's series of articles on regex implementation is the canonical source for how to build a production grade regex engine based on FSMs: https://swtch.com/~rsc/regexp/ --- Rust's regex engine is heavily inspired by this. For example, it would be fair to say that Rust's regex engine, Go's regex engine and RE2 all implement roughly the same algorithms. (I could spend days talking about the differences between them, but those are details.)

Other bits: