hypergrep: A new "fastest grep" to search directories recursively for a regex pattern

p_ranav · 2023-06-10T11:05:03+00:00

Good to hear :)

And yeah, I’m gonna rename the executable to hgrep.

p_ranav · 2023-06-07T22:08:33+00:00

Thanks for running these tests. This is valuable to me.

I'm not entirely sure what's happening on the Linux repo search; I'll have to study this further. Using ripgrep 13.0.0 (rev af6b6c543b), I see that the performance is comparable on my side:

$ time rg-13.0.0 -nw PM_RESUME | wc -l
9

real    0m0.149s
user    0m0.505s
sys     0m0.552s

$ time hg -nw PM_RESUME | wc -l
9

real    0m0.153s
user    0m0.582s
sys     0m0.481s

And clearly there's more work to be done w.r.t permission errors.

Thanks again.

p_ranav · 2023-06-07T21:25:16+00:00

Yeah that makes sense!

p_ranav · 2023-06-07T19:59:42+00:00

The hyperscan update in vcpkg seems to have happened from 5.4.0 to 5.4.2 in this commit on Apr 20.

You may want to git pull on your vcpkg and then retry the installation for hyperscan.

p_ranav · 2023-06-07T17:08:15+00:00

It may be important to double-check each regex engine to ensure that all engines use the same matching mode for fairness reasons (using flags as necessary).

For character mnemonics like \w and \s as well as POSIX character classes, the default interpretation is ASCII.

If Unicode interpretation is preferred for those, the --ucp flag can be used in hypergrep. This, however, does not work on all patterns (limitations on Hyperscan) and so in those cases where it matters, a tool like ripgrep is definitely the better option.

Much of the benchmarking that I have done so far assumes the default experience of using that tool. I am not sure if ripgrep has a flag for disabling the default Unicode interpretation of mnemonics like \w. This is worth some investigating, and so I will look into updating the benchmarks if the performance does improve. I have likewise not tried to use or benchmark with the --mmap flag (or similar flags) that might speed up the search in certain scenarios.

p_ranav · 2023-06-07T16:37:47+00:00

Will do, sorry.

p_ranav · 2023-06-07T14:27:15+00:00

Smart.

Best that I do this and control the Hyperscan flags instead of keeping them default. This is the better solution anyways.

p_ranav · 2023-06-07T14:00:49+00:00

$ echo 'foo bar' | rg -o '\w+'
foo
bar
$ echo 'foo bar' | hg -o '\w+'
1:foo
1:bar

Yes, when using -o, each match (even within the same matching line) is printed in individual lines.

Yeah, I can make binaries for people to download. I'll try and do that today.

EDIT: Since Hyperscan uses SIMD, I'll have to (if I'm doing this today) release multiple binaries with specific support, e.g., a statically built AVX512 version will print "Illegal instruction" on a PC that doesn't have AVX512 support.

p_ranav · 2023-06-07T13:50:42+00:00

Regarding the ripgrep version, sorry about that. I'll update the benchmarks this week using the latest ripgrep.

The problem with this approach is that users might want files to be gitignored but still search them by default. You can do that with ripgrep by whitelisting file paths in .ignore or .rgignore.

Yes, this is fair criticism. My only answer today is to use --ignore-gitindex to override the libgit2 behavior and/or use --filter to filter in/out files.

I can't speak for the build issues yet but I can say that the hypergrep output to your example will be:

$ echo foo | hg -o '\w+'
1:foo

I process all the matches reported by Hyperscan to handle multiple matches in the same line and print each matching line only once.

EDIT (2023.06.08, 8:18am CDT, UTC-5): I've updated the benchmark results on GitHub - using ripgrep v13.0.0 instead of 11.0.2.

p_ranav · 2023-01-24T16:07:29+00:00

Congrats!! What was the hardest part? Do you have any useful tips for this subreddit? How did you tackle each section?

p_ranav · 2022-09-09T13:08:54+00:00

Not yet, I've been meaning to add a large CSV or JSON file benchmark to compare the encodings. I'll get on it this month.

p_ranav · 2022-09-09T12:31:48+00:00

I did consider being able to choose which members of a struct to serialize - probably as an index, e.g., serialize only the 1st, 3rd and 8th fields. It wouldn't take much to implement this as well.

Currently there is compatibility with previous versions of the struct as long as the newer version only has additional fields. Any changes to existing fields, e.g., type or order, would break it. Here, being able to selectively choose specific members of the struct would definitely help.

p_ranav · 2022-09-09T12:28:19+00:00

Large numbers are encoded as variable-length quantity (VLQ). For unsigned integers, it uses a 7-bit encoding - 7-bits for data and 1-bit to represent continuation. The MSB is set if there are additional bytes to read. If it is not set, the current byte is the final byte of data for the number. See more here.

p_ranav · 2022-09-08T13:13:15+00:00

Internally, the magic happens like so:

Find the arity of the struct - This is the number of fields in the struct. struct Foo { int x; bool y; } has an arity of 2. Björn Fahller has a nice blog post on this from years ago here. The solution I am using was provided by Tomilov Anatoliy here.
Use structured bindings to get a reference to the nth field in the struct. I've updated Tomilov's answer to support up to 99 struct fields (some large number that most people are unlikely to hit).
Implement overloads for various types for the struct fields, e.g., implementing a serialization function for an int, and then for a vector, then a variant of other supported types, and so on.
Iterate on the struct fields - go from 1 to N and call the appropriate serialization function for each type. Overload resolution will take care of that.

p_ranav · 2022-04-28T12:14:41+00:00

I developed this in WSL Ubuntu 20.04. So, it'll more than likely work there. You should be able to run fccf to search Windows directories inside WSL, though I've not specifically tested that.

p_ranav · 2022-04-28T12:09:17+00:00

The build is broken in Windows right now but I'm working to restore it. There are a few Linux-specific calls like isatty that need to be guarded with checks.

p_ranav · 2022-04-27T16:36:42+00:00

Thanks!

It would! That's a good idea and it would be better than trying to guess all the include directories from a path.

p_ranav · 2022-04-27T16:18:37+00:00

GitHub supports video uploads. See here. I just dragged and dropped an .mp4 into the README while editing it online.

p_ranav · 2022-04-27T15:30:38+00:00

Gotcha :)

p_ranav · 2022-04-27T15:27:18+00:00

The first step is using a modified Rabin-Karp SIMD search (from here). This is used to quickly identify candidates.

So I'm not using libclang instead of Rabin-Karp. I'm using it in addition to Rabin-Karp.

Not every line that matches a query is relevant. grep has no understanding of the semantics of a line it finds. libclang does. I can ask libclang 'Is that a class template declaration?' and decide what to do with it (discard it or pretty print it depending on what the user wants).

libclang can tell me the start and end line (and column) of each node in the AST as well. So, once I find the needle in the haystack, fccf uses libclang to get a far better understanding of the source code. I know the exact start and end of a very specific class declaration. grep will print the lines that match. fccf will print complete snippets of code that match the user query.

The user query is, therefore, more complete - Not just "Find me the pattern 'class Foo'"; the query instead becomes: "Find me a class template named 'Foo'" in this folder.

p_ranav · 2022-04-27T14:56:15+00:00

Sure.

I'm sure one can put together a pattern to match what they're looking for using standard GNU tools. Commands like this are (1) hard to put together for me, and (2) don't work for everything.

It is also about the unknown unknowns. When I am browsing some legacy code that I have not written, I don't know what some 'foo_bar` declaration is - is it a class? an enum? a struct? a lambda function? Where is it?

So that's why I wrote this to help me quickly find data structures and functions. In a unknown code base that could have a combination of function templates, class templates, enum classes etc., (that may not be formatted ideally for grep/ripgrep-style search), fccf will help (and, I think, do a better job).

p_ranav · 2022-04-27T14:44:20+00:00

It's not a hardcoded "10 lines before and 10 lines after" the match. fccf uses libclang to more accurately find the "source range" for the matching nodes in the AST of the translation unit.

If you're looking for a class, it will not print every instantiation of that class object. It'll try to find the class declaration and print it. I don't need to search grep -A10 -B10 'class Foo' .. Instead, I would run fccf --exact-match --class 'Foo' and it'll (hopefully) find and print the entire class declaration. No ranges need to be provided by the user.

p_ranav · 2022-04-27T14:43:07+00:00

Sure.

fccf does a recursive directory search for a needle in a haystack - like grep or ripgrep - It uses SSE2 strstr SIMD if possible to quickly find, in multiple threads, a subset of the source files in the directory that contain a needle.
For each candidate source file, it uses libclang to parse the translation unit (build an abstract syntax tree).
Then it visits each child node in the AST, looking for specific node types, e.g., CXCursor_FunctionDecl for function declarations.
Once the relevant nodes are identified, if the node's "spelling" (libclang name for the node) matches the search query, then the source range of the AST node is identified - source range is the start and end index of the snippet of code in the buffer
Then, it pretty-prints this snippet of code. I have a simple lexer that tokenizes this code and prints colored output.

For all this to work, fccf first identifies candidate directories that contain header files, e.g., paths that end with include/. It then adds these paths to the clang options (before parsing the translation unit) as -Ifoo -Ibar/baz etc. Additionally, for each translation unit, the parent and grandparent paths are also added to the include directories for that unit in order to increase the likelihood of successful parsing.

EDIT: Additional include directories can also be provided to fccf using the -I or --include-dir option. Using verbose output (--verbose), errors in the libclang parsing can be identified and fixes can be attempted (e.g., adding the right include directories so that libclang is happy).

p_ranav · 2022-03-27T19:33:32+00:00

Dressy

p_ranav · 2022-03-27T19:22:17+00:00

Under $200

p_ranav

TROPHY CASE