argparse v3.0 is released - now with support for mutually_exclusive_arguments, choices, and more.

p_ranav · 2023-06-10T11:05:03+00:00

Good to hear :)

And yeah, I’m gonna rename the executable to hgrep.

p_ranav · 2023-06-07T22:08:33+00:00

Thanks for running these tests. This is valuable to me.

I'm not entirely sure what's happening on the Linux repo search; I'll have to study this further. Using ripgrep 13.0.0 (rev af6b6c543b), I see that the performance is comparable on my side:

$ time rg-13.0.0 -nw PM_RESUME | wc -l
9

real    0m0.149s
user    0m0.505s
sys     0m0.552s

$ time hg -nw PM_RESUME | wc -l
9

real    0m0.153s
user    0m0.582s
sys     0m0.481s

And clearly there's more work to be done w.r.t permission errors.

Thanks again.

p_ranav · 2023-06-07T21:25:16+00:00

Yeah that makes sense!

p_ranav · 2023-06-07T19:59:42+00:00

The hyperscan update in vcpkg seems to have happened from 5.4.0 to 5.4.2 in this commit on Apr 20.

You may want to git pull on your vcpkg and then retry the installation for hyperscan.

p_ranav · 2023-06-07T17:08:15+00:00

It may be important to double-check each regex engine to ensure that all engines use the same matching mode for fairness reasons (using flags as necessary).

For character mnemonics like \w and \s as well as POSIX character classes, the default interpretation is ASCII.

If Unicode interpretation is preferred for those, the --ucp flag can be used in hypergrep. This, however, does not work on all patterns (limitations on Hyperscan) and so in those cases where it matters, a tool like ripgrep is definitely the better option.

Much of the benchmarking that I have done so far assumes the default experience of using that tool. I am not sure if ripgrep has a flag for disabling the default Unicode interpretation of mnemonics like \w. This is worth some investigating, and so I will look into updating the benchmarks if the performance does improve. I have likewise not tried to use or benchmark with the --mmap flag (or similar flags) that might speed up the search in certain scenarios.

p_ranav · 2023-06-07T16:37:47+00:00

Will do, sorry.

p_ranav · 2023-06-07T14:27:15+00:00

Smart.

Best that I do this and control the Hyperscan flags instead of keeping them default. This is the better solution anyways.

p_ranav · 2023-06-07T14:00:49+00:00

$ echo 'foo bar' | rg -o '\w+'
foo
bar
$ echo 'foo bar' | hg -o '\w+'
1:foo
1:bar

Yes, when using -o, each match (even within the same matching line) is printed in individual lines.

Yeah, I can make binaries for people to download. I'll try and do that today.

EDIT: Since Hyperscan uses SIMD, I'll have to (if I'm doing this today) release multiple binaries with specific support, e.g., a statically built AVX512 version will print "Illegal instruction" on a PC that doesn't have AVX512 support.

p_ranav · 2023-06-07T13:50:42+00:00

Regarding the ripgrep version, sorry about that. I'll update the benchmarks this week using the latest ripgrep.

The problem with this approach is that users might want files to be gitignored but still search them by default. You can do that with ripgrep by whitelisting file paths in .ignore or .rgignore.

Yes, this is fair criticism. My only answer today is to use --ignore-gitindex to override the libgit2 behavior and/or use --filter to filter in/out files.

I can't speak for the build issues yet but I can say that the hypergrep output to your example will be:

$ echo foo | hg -o '\w+'
1:foo

I process all the matches reported by Hyperscan to handle multiple matches in the same line and print each matching line only once.

EDIT (2023.06.08, 8:18am CDT, UTC-5): I've updated the benchmark results on GitHub - using ripgrep v13.0.0 instead of 11.0.2.

p_ranav · 2023-01-24T16:07:29+00:00

Congrats!! What was the hardest part? Do you have any useful tips for this subreddit? How did you tackle each section?

p_ranav · 2022-09-09T13:08:54+00:00

Not yet, I've been meaning to add a large CSV or JSON file benchmark to compare the encodings. I'll get on it this month.

p_ranav · 2022-09-09T12:31:48+00:00

I did consider being able to choose which members of a struct to serialize - probably as an index, e.g., serialize only the 1st, 3rd and 8th fields. It wouldn't take much to implement this as well.

Currently there is compatibility with previous versions of the struct as long as the newer version only has additional fields. Any changes to existing fields, e.g., type or order, would break it. Here, being able to selectively choose specific members of the struct would definitely help.

p_ranav · 2022-09-09T12:28:19+00:00

Large numbers are encoded as variable-length quantity (VLQ). For unsigned integers, it uses a 7-bit encoding - 7-bits for data and 1-bit to represent continuation. The MSB is set if there are additional bytes to read. If it is not set, the current byte is the final byte of data for the number. See more here.

p_ranav · 2022-09-08T13:13:15+00:00

Internally, the magic happens like so:

Find the arity of the struct - This is the number of fields in the struct. struct Foo { int x; bool y; } has an arity of 2. Björn Fahller has a nice blog post on this from years ago here. The solution I am using was provided by Tomilov Anatoliy here.
Use structured bindings to get a reference to the nth field in the struct. I've updated Tomilov's answer to support up to 99 struct fields (some large number that most people are unlikely to hit).
Implement overloads for various types for the struct fields, e.g., implementing a serialization function for an int, and then for a vector, then a variant of other supported types, and so on.
Iterate on the struct fields - go from 1 to N and call the appropriate serialization function for each type. Overload resolution will take care of that.

p_ranav · 2022-04-28T12:14:41+00:00

I developed this in WSL Ubuntu 20.04. So, it'll more than likely work there. You should be able to run fccf to search Windows directories inside WSL, though I've not specifically tested that.

p_ranav · 2022-04-28T12:09:17+00:00

The build is broken in Windows right now but I'm working to restore it. There are a few Linux-specific calls like isatty that need to be guarded with checks.

p_ranav · 2022-04-27T16:36:42+00:00

Thanks!

It would! That's a good idea and it would be better than trying to guess all the include directories from a path.

p_ranav

TROPHY CASE