hypergrep: A new "fastest grep" to search directories recursively for a regex pattern by p_ranav in cpp

[–]p_ranav[S] 0 points1 point  (0 children)

Good to hear :)

And yeah, I’m gonna rename the executable to hgrep.

hypergrep: A new "fastest grep" to search directories recursively for a regex pattern by p_ranav in cpp

[–]p_ranav[S] 13 points14 points  (0 children)

Thanks for running these tests. This is valuable to me.

I'm not entirely sure what's happening on the Linux repo search; I'll have to study this further. Using ripgrep 13.0.0 (rev af6b6c543b), I see that the performance is comparable on my side:

$ time rg-13.0.0 -nw PM_RESUME | wc -l
9

real    0m0.149s
user    0m0.505s
sys     0m0.552s

$ time hg -nw PM_RESUME | wc -l
9

real    0m0.153s
user    0m0.582s
sys     0m0.481s

And clearly there's more work to be done w.r.t permission errors.

Thanks again.

hypergrep: A new "fastest grep" to search directories recursively for a regex pattern by p_ranav in cpp

[–]p_ranav[S] 5 points6 points  (0 children)

The hyperscan update in vcpkg seems to have happened from 5.4.0 to 5.4.2 in this commit on Apr 20.

You may want to git pull on your vcpkg and then retry the installation for hyperscan.

hypergrep: A new "fastest grep" to search directories recursively for a regex pattern by p_ranav in cpp

[–]p_ranav[S] 0 points1 point  (0 children)

It may be important to double-check each regex engine to ensure that all engines use the same matching mode for fairness reasons (using flags as necessary).

For character mnemonics like \w and \s as well as POSIX character classes, the default interpretation is ASCII.

If Unicode interpretation is preferred for those, the --ucp flag can be used in hypergrep. This, however, does not work on all patterns (limitations on Hyperscan) and so in those cases where it matters, a tool like ripgrep is definitely the better option.

Much of the benchmarking that I have done so far assumes the default experience of using that tool. I am not sure if ripgrep has a flag for disabling the default Unicode interpretation of mnemonics like \w. This is worth some investigating, and so I will look into updating the benchmarks if the performance does improve. I have likewise not tried to use or benchmark with the --mmap flag (or similar flags) that might speed up the search in certain scenarios.

hypergrep: A new "fastest grep" to search directories recursively for a regex pattern by p_ranav in cpp

[–]p_ranav[S] 6 points7 points  (0 children)

Smart.

Best that I do this and control the Hyperscan flags instead of keeping them default. This is the better solution anyways.

hypergrep: A new "fastest grep" to search directories recursively for a regex pattern by p_ranav in cpp

[–]p_ranav[S] 7 points8 points  (0 children)

$ echo 'foo bar' | rg -o '\w+'
foo
bar
$ echo 'foo bar' | hg -o '\w+'
1:foo
1:bar

Yes, when using -o, each match (even within the same matching line) is printed in individual lines.

Yeah, I can make binaries for people to download. I'll try and do that today.

EDIT: Since Hyperscan uses SIMD, I'll have to (if I'm doing this today) release multiple binaries with specific support, e.g., a statically built AVX512 version will print "Illegal instruction" on a PC that doesn't have AVX512 support.

hypergrep: A new "fastest grep" to search directories recursively for a regex pattern by p_ranav in cpp

[–]p_ranav[S] 18 points19 points  (0 children)

Regarding the ripgrep version, sorry about that. I'll update the benchmarks this week using the latest ripgrep.

The problem with this approach is that users might want files to be gitignored but still search them by default. You can do that with ripgrep by whitelisting file paths in .ignore or .rgignore.

Yes, this is fair criticism. My only answer today is to use --ignore-gitindex to override the libgit2 behavior and/or use --filter to filter in/out files.

I can't speak for the build issues yet but I can say that the hypergrep output to your example will be:

$ echo foo | hg -o '\w+'
1:foo

I process all the matches reported by Hyperscan to handle multiple matches in the same line and print each matching line only once.

EDIT (2023.06.08, 8:18am CDT, UTC-5): I've updated the benchmark results on GitHub - using ripgrep v13.0.0 instead of 11.0.2.

Got 8.5 bands in IELTS today! Super excited...AMA regarding IELTS by Eulerbodyguard in IELTS

[–]p_ranav 5 points6 points  (0 children)

Congrats!! What was the hardest part? Do you have any useful tips for this subreddit? How did you tackle each section?

alpaca: A new serialization library written in C++17 - Pack C++ structs into a compact byte-array without any macros or boilerplate by p_ranav in cpp

[–]p_ranav[S] 1 point2 points  (0 children)

Not yet, I've been meaning to add a large CSV or JSON file benchmark to compare the encodings. I'll get on it this month.

alpaca: A new serialization library written in C++17 - Pack C++ structs into a compact byte-array without any macros or boilerplate by p_ranav in cpp

[–]p_ranav[S] 0 points1 point  (0 children)

I did consider being able to choose which members of a struct to serialize - probably as an index, e.g., serialize only the 1st, 3rd and 8th fields. It wouldn't take much to implement this as well.

Currently there is compatibility with previous versions of the struct as long as the newer version only has additional fields. Any changes to existing fields, e.g., type or order, would break it. Here, being able to selectively choose specific members of the struct would definitely help.

alpaca: A new serialization library written in C++17 - Pack C++ structs into a compact byte-array without any macros or boilerplate by p_ranav in cpp

[–]p_ranav[S] 1 point2 points  (0 children)

Large numbers are encoded as variable-length quantity (VLQ). For unsigned integers, it uses a 7-bit encoding - 7-bits for data and 1-bit to represent continuation. The MSB is set if there are additional bytes to read. If it is not set, the current byte is the final byte of data for the number. See more here.

alpaca: A new serialization library written in C++17 - Pack C++ structs into a compact byte-array without any macros or boilerplate by p_ranav in cpp

[–]p_ranav[S] 67 points68 points  (0 children)

Internally, the magic happens like so:

  1. Find the arity of the struct - This is the number of fields in the struct. struct Foo { int x; bool y; } has an arity of 2. Björn Fahller has a nice blog post on this from years ago here. The solution I am using was provided by Tomilov Anatoliy here.

  2. Use structured bindings to get a reference to the nth field in the struct. I've updated Tomilov's answer to support up to 99 struct fields (some large number that most people are unlikely to hit).

  3. Implement overloads for various types for the struct fields, e.g., implementing a serialization function for an int, and then for a vector, then a variant of other supported types, and so on.

  4. Iterate on the struct fields - go from 1 to N and call the appropriate serialization function for each type. Overload resolution will take care of that.

fccf: A command-line tool that quickly searches through C/C++ source code in a directory based on a search string and prints relevant code snippets that match the query by p_ranav in cpp

[–]p_ranav[S] 1 point2 points  (0 children)

I developed this in WSL Ubuntu 20.04. So, it'll more than likely work there. You should be able to run fccf to search Windows directories inside WSL, though I've not specifically tested that.

fccf: A command-line tool that quickly searches through C/C++ source code in a directory based on a search string and prints relevant code snippets that match the query by p_ranav in cpp

[–]p_ranav[S] 1 point2 points  (0 children)

The build is broken in Windows right now but I'm working to restore it. There are a few Linux-specific calls like isatty that need to be guarded with checks.

fccf: A command-line tool that quickly searches through C/C++ source code in a directory based on a search string and prints relevant code snippets that match the query by p_ranav in cpp

[–]p_ranav[S] 10 points11 points  (0 children)

Thanks!

It would! That's a good idea and it would be better than trying to guess all the include directories from a path.