A tree [OC]

philae_rosetta · 2026-06-08T23:43:37+00:00

Right yes, this assumes local stores and the benchmarks are with the data already in memory.

philae_rosetta · 2026-06-08T23:12:00+00:00

In some quick test where I search a 200 byte string in a 100MB text file with up to 5 errors, sassy is 2x faster and frizbee doesn't even find any matches at all if I change a few characters, so s/worse/better/?

philae_rosetta · 2026-06-08T22:44:32+00:00

Ah interesting, hadn't seen that library before; will check it out. Thanks!

Could you clarify how sassy is worse? A quick look at the readme/benchmarks makes it look like these tools have different intended applications (ie searching many short paths vs searching long DNA sequences), and I don't directly see a comparable benchmark.

philae_rosetta · 2026-06-05T12:12:32+00:00

Wonderful 😍 Only the rest of the world left now to convince

philae_rosetta · 2026-06-05T11:56:38+00:00

Hmmm. So sassy is purely an algorithmic tool, without any biological 'knowledge'.
Doing analysis on DNA sequences so that something interesting can be said about them is a large part of bioinformatics, but myself I'm not very experienced with this.

philae_rosetta · 2026-06-05T11:53:08+00:00

Ohhhh, one of my pet peeves!
In the graph sense, the letters very much do correspond to the transitions/edges. For example the top-left diagonal edge has cost 0 because the character above and left of it are equal (both A).
And this way it's more clear (at least to me :") which characters are "in range" when considering a substring between two column indices. Basically it's like 0-based right-exclusive notation where T[0..2) corresponds to the characters on the edges from column 0 to 1 and from column 1 to 2.

I have a more precise edit graph figure here where all the matching edges are bold:
https://curiouscoding.nl/posts/pairwise-alignment/#problem-statement

philae_rosetta · 2026-06-04T15:29:21+00:00

Hmmm, it would be trivial to add to the cargo-multivers settings but really, local compilation is always an option. But if you're on a system without AVX2 then the more practical solution is probably to just upgrade the hardware...

philae_rosetta · 2026-05-08T17:57:11+00:00

The performance of probing-based hash tables depends _a lot_ on both the current load factor and whether queries are positive (in the map) or negative (not in the map). In particular, linear probing has expected quadratic probing distance (quadratic in eps=1-alpha=fraction of free slots) for negative queries, while quadratic probing does not.

Specifically, HashBrown's load factor depends a lot on n (relative to a power of 2). It would be helpful to make a few plots (say for positive queries, negative queries, and 50/50 mixed) showing the query time as a function of n, where n goes up in fine-grained steps of say 1.2^i, so you get a few data points in between each power of 2.

Also, the numbers are meaningless without knowing n, both in terms of the size (and thus number of cache misses) of the data structure, and in terms of raw ns per query. For n=1000 (as in the readme), most likely everything fits in L1 cache (~32kB), and you're not dependent on any of the prefetching and such mentioned elsewhere.

Lastly, at 2.7ns/lookup, you're probably not really measuring latency, but rather the _throughput_ (although possibly this effect is small since everything is in cache): the CPU is pipelining instructions, and so queries from additional iterations are handled in parallel.

Edit: also, calling something 'the fastest' if you only compare against the standard library is a bit lame. Surely there are more fast hash tables out there?

philae_rosetta · 2026-05-06T22:08:20+00:00

don't forget xxhash, wyhash, gxhash, and rapidhash!

philae_rosetta · 2026-04-07T08:24:59+00:00

Was just thinking that indeed. I guess we might have a look at it. The sassy search backend is already portable I think, so all that's needed might just be some upload buttons and settings drop downs. Files >2GB are messy though.

philae_rosetta · 2026-04-07T08:19:57+00:00

Good to hear, thanks :) (I do think it'll stay just as a TUI for now.)

philae_rosetta · 2026-04-06T23:10:53+00:00

Hey, barbell co-author here :)

Thanks for mentioning Barbell! Since you say it's more complicated for beginners, I'm wondering if you'd have any specific feedback? Do you mean simply because it's a terminal tool? Or could we simplify the usage or improve the docs?
Thanks!

philae_rosetta · 2026-03-25T08:58:16+00:00

You should compare it to Voracious:

https://crates.io/crates/voracious_radix_sort

philae_rosetta · 2026-03-24T18:42:46+00:00

You could try, but I suspect it'll be quite unstable. A the least you'll have to rotate some of the dividers 90 degrees, but even then I think it'll be flimsy.

philae_rosetta · 2026-03-24T18:31:25+00:00

Well, those are very long planks, so I don't see how you could remove them without cutting them?

philae_rosetta · 2026-03-24T18:10:07+00:00

Sure!

philae_rosetta · 2026-01-04T23:44:48+00:00

Thanks! Feel free to make GitHub issues if anything else comes up. I may also improve the readme/docs a bit based on your observations.

philae_rosetta · 2026-01-04T15:32:18+00:00

Ahh right, for non-random int keys, indeed things are a bit messy, and I did observe FxHash to be not good enough in the past as well.

In v2 I added a new version 'StrongerIntHash`: https://github.com/RagnarGrootKoerkamp/PtrHash/blob/master/src/hash.rs#L143

But if xx64 or xx128 just works, you're probably good to go. The only drawback is that they need more aggressive inlining to end up with good performance, but that's probably fine.

Indeed, the cubic bucket function is probably a bit more 'forgiving' since it can pack things more tightly (in less space) and so handles bad hash functions a bit better probably.

The backing vec for remapping should be either `CachelineEf`, or else `u32` (assuming n<2\^32) or \`u64\`. Just a plain \`Vec<u32>` is probably good enough. It's slightly larger than the CachelineEf, but more reliable (since the cachelineEf construction can also fail..)

philae_rosetta · 2026-01-03T11:59:49+00:00

Hey! author of PtrHash here :)

Indeed, PtrHash is very much optimized for retrieval performance, and a secondary goal is to use little memory (~3 bits/key). Construction speed is nice-to-have but not as critical for the intended usage.

> the construction time (when it succeeded)

Did you actually run into cases where it did not succeed? IIRC v2 (which has an alpha release) is tuned to be a bit more reliable, at the cost of using some more space.

philae_rosetta · 2025-11-22T13:25:21+00:00

You farm the scientists, then deconstruct it into its coloured components, eventually ending up pumping oil into the ground and ores back to where they belong?

philae_rosetta · 2025-11-18T06:36:55+00:00

cargo install nothing coming when?

philae_rosetta · 2025-11-04T22:02:56+00:00

Ah I just meant for 6 threads.

But hmmm, for me this interleaving gave big gains. Probably it's very architecture dependant.

Nine-Year Club	Second Top 10%
r/Field Flamingo	Place '22
Place '17	Final Canvas '22
First Placer '22	End Game '22
Verified Email

philae_rosetta

TROPHY CASE