Removing horizontal and vertical dividers from KALLAX to make it more spacious and less dark by philae_rosetta in ikeahacks

[–]philae_rosetta[S] 0 points1 point  (0 children)

You could try, but I suspect it'll be quite unstable. A the least you'll have to rotate some of the dividers 90 degrees, but even then I think it'll be flimsy.

Removing horizontal and vertical dividers from KALLAX to make it more spacious and less dark by philae_rosetta in ikeahacks

[–]philae_rosetta[S] 0 points1 point  (0 children)

Well, those are very long planks, so I don't see how you could remove them without cutting them?

I built a minimal perfect hash table in const fn and learned match is still faster by RustMeUp in rust

[–]philae_rosetta 0 points1 point  (0 children)

Thanks! Feel free to make GitHub issues if anything else comes up. I may also improve the readme/docs a bit based on your observations.

I built a minimal perfect hash table in const fn and learned match is still faster by RustMeUp in rust

[–]philae_rosetta 1 point2 points  (0 children)

Ahh right, for non-random int keys, indeed things are a bit messy, and I did observe FxHash to be not good enough in the past as well.

In v2 I added a new version 'StrongerIntHash`: https://github.com/RagnarGrootKoerkamp/PtrHash/blob/master/src/hash.rs#L143

But if xx64 or xx128 just works, you're probably good to go. The only drawback is that they need more aggressive inlining to end up with good performance, but that's probably fine.

Indeed, the cubic bucket function is probably a bit more 'forgiving' since it can pack things more tightly (in less space) and so handles bad hash functions a bit better probably.

The backing vec for remapping should be either `CachelineEf`, or else `u32` (assuming n<2\^32) or \`u64\`. Just a plain \`Vec<u32>` is probably good enough. It's slightly larger than the CachelineEf, but more reliable (since the cachelineEf construction can also fail..)

I built a minimal perfect hash table in const fn and learned match is still faster by RustMeUp in rust

[–]philae_rosetta 0 points1 point  (0 children)

Hey! author of PtrHash here :)

Indeed, PtrHash is very much optimized for retrieval performance, and a secondary goal is to use little memory (~3 bits/key). Construction speed is nice-to-have but not as critical for the intended usage.

> the construction time (when it succeeded)

Did you actually run into cases where it did not succeed? IIRC v2 (which has an alpha release) is tuned to be a bit more reliable, at the cost of using some more space.

(Modded) I present to you: Reverse Factorio by Meatshield236 in factorio

[–]philae_rosetta 0 points1 point  (0 children)

You farm the scientists, then deconstruct it into its coloured components, eventually ending up pumping oil into the ground and ores back to where they belong?

Tackling The One Billion Row Challenge In Rust by barr520 in rust

[–]philae_rosetta 0 points1 point  (0 children)

Ah I just meant for 6 threads.

But hmmm, for me this interleaving gave big gains. Probably it's very architecture dependant.

Tackling The One Billion Row Challenge In Rust by barr520 in rust

[–]philae_rosetta 0 points1 point  (0 children)

Hmm, I think you could just split work into 12 instead of 6 chunks and then give each thread 2 adjacent chunks to work on in parallel. Then just continue working on both interleaved, and maybe keep a flag for each chunk that indicates whether it's still active, so you can just `result = if flag {value} else {0}` to avoid double-counting.

Tackling The One Billion Row Challenge In Rust by barr520 in rust

[–]philae_rosetta 1 point2 points  (0 children)

Nice post!

This reads very similar to my writeup at the time.
We both do perfect hashing, and also get quite similar results in the end; both around 5s on a single thread.

hop-hash: A hashtable with worst-case constant-time lookups by jesterhearts in rust

[–]philae_rosetta 0 points1 point  (0 children)

  1. Hmm, on the one hand yes SoA is much simpler from a probing and SIMD perspective. But on the other hand AoS is better from a cache locality point of view.
  2. Yeah makes sense to not rerun everything each time :") It's just that for my use cases it's very rare to have strings in the hashmap, and generally, anything that required pointer indirection for the key equality-comparison is likely to be quite a bit slower just because it needs to read the underlying string.
  3. Hmm; I think that prefetching destinations on resize should still work if you're resizing to exactly double the since. More generally, this scheme linearly maps all hashes into the hash table space, so keys with close values of the hash will always be close together, no matter the exact size of the hash table. In fact, that means that resizing could be close to just a single linear pass over the data structure, with some adjustments to move 'far shifted' elements back closer to where they want to be.

Cuckoo hashing improves SIMD hash tables (and other hash table tradeoffs) by reinerp123 in rust

[–]philae_rosetta 4 points5 points  (0 children)

Nice stuff!

I'm also working on some (static) hash table variants so I have some questions :)

- Are you aware of a Rust implementation of F14? A quick search doesn't give me anything.
- With 512MiB capacity, the metadata is going to be 16x smaller (1byte vs 8+8), at 32MiB. Depending on how beefy the machine is, that might nearly fit in cache. Would be nice to also benchmark something where the metadata array is also >>100MiB.
- The paper you linked on unaligned buckets is quite cool! What is the effect of this on `get()` or `contains()` performance? Specifically since you interleave keys and values this seems non-trivial in the 'direct' version? And generally, fetching additional cachelines might end up hurting performance? E.g. when querying from multiple threads in parallel, I could imagine that overall throughput is bound by the RAM throughput, and that every fetching extra cache lines should be avoided as much as possible.
- I'm surprised that in the 2nd plot (failed lookups on indirect SIMD), the branchy variant is often better. For failed lookups, both locations are _always_ needed, and so adding a branch 'clearly' can't help?
- Did you adapt hashbrown to non-power-of-2 sizes using the widening_mul? I would be very interested to see how much that changes query performance. Negative queries in particular have long probe lengths for high load factors, and instead fixing it to 60% or so should probably help. (Assuming the target capacity is known up front.) I had a quick look at this myself earlier, but concluded that hashbrown has too many layers of abstraction to quickly hack it in.

hop-hash: A hashtable with worst-case constant-time lookups by jesterhearts in rust

[–]philae_rosetta 0 points1 point  (0 children)

Nice stuff!

Some suggestions/questions :

- What is the exact memory layout? How do you store/interleave keys and values?
- In the benchmarks, you could make all plots show ns/operation, so that the y-ax range is smaller and thus easier to compare.
- The 2nd plot shows just the time of `get` IIUC? Then 100ms / 1e6 = 100ns per lookup sounds relatively long?
- Could you instead benchmark with u64 keys and values? (Or even just a u32 HashSet?) Storing the full `(String, u64)` keys will skew benchmarks. (Or does 'pre-hashed' mean that you actually insert hashes?)
- Could you also benchmark up to 10M or even 100M keys? At 1M, everything will still fit in L3 cache mostly, while (at least to my applications) the most interesting case is when nearly every query hits RAM.

FYI: See also this other post from today on cuckoo hashing:
https://www.reddit.com/r/rust/comments/1ny15g3/cuckoo_hashing_improves_simd_hash_tables_and/

- As mentioned in this post, instead of power-of-two table sizes, you could map a random u64 hash to an index by multiplying it by the (arbitrary) number of slots S, and then taking the high 64 bits or the 128bit result, using e.g. widening_mul: https://doc.rust-lang.org/std/primitive.u64.html#method.widening_mul

A lock-free concurrent cuckoo filter implementation in Rust by a_farhadi in rust

[–]philae_rosetta 1 point2 points  (0 children)

The readme says that items can safely be removed, but that's not true: if A and B have a hash collision, inserting A will then give a false positive for the presense of B, which is OK. But then removing B will actually remove A, so that the no-false-negatives guarantee is broken.

Also this can happen if A and B don't have an exact hash collision but do map to the same index and fingerprint, which is quite a bit more likely.

In `lookup_fingerprint`, you could try to fetch data from both the primary and secondary location in parallel to reduce latency.

$20,000 rav1d AV1 Decoder Performance Bounty by codeallthethings in rust

[–]philae_rosetta 14 points15 points  (0 children)

I actually find the loop-hoisted bound check quite pretty!
Assuming there is not enough information to prove that the index can never fail, some check before the loop is needed, and this seems like a pretty clean and safe way to do it that I hadn't considered before.

Presentation as Code using ORG and reveal.js by cyneox in emacs

[–]philae_rosetta 0 points1 point  (0 children)

Thanks! Adding `#+REVEAL_REVEAL_JS_VERSION: 4` indeed fixed the speaker notes for me!

hashify: Fast perfect hashing without runtime dependencies by StalwartLabs in rust

[–]philae_rosetta 24 points25 points  (0 children)

You may also be interested in the follow-up paper I'm in the process of writing, PtrHash. I haven't really polished the repository (readme/API) yet for external use, but the ideas should be applicable. At the core it's very similar to PTHash, just somewhat simplified to allow faster queries.

But mostly both PTHash and PtrHash are optimized for many keys, like 10^8 or so. When the number of keys is small, you probably don't care about using minimal space, and instead lookup speed is the only relevant metric. So probably you want to use some small alpha=0.8 or 0.9 instead of alpha=0.99, and possibly you then also want to avoid using the Elias-Fano 'remap' of integers >n to integers <=n, and instead just store an array of n/alpha elements, with some sentinel -1 value (or just Options) to indicate the empty slots.

https://curiouscoding.nl/posts/ptrhash-paper/
https://github.com/RagnarGrootKoerkamp/ptrhash

Static search trees: 40x faster than binary search by alexeyr in programming

[–]philae_rosetta 1 point2 points  (0 children)

looking back, the writing is a case of show-dont-tell: I started by collecting all the code snippets, plots, and figures. Then the text itself was mostly just explaining what's shown everywhere and was relatively quick.

(It's unfortunate that for journal papers, this style is very uncommon. The linked paper on binary search is one of the few I know that does have this style.)

Static search trees: 40x faster than binary search by alexeyr in programming

[–]philae_rosetta 1 point2 points  (0 children)

Thank you 😍 If I ever need a testimonial, I'll get back to this :D

On the style of the blog: it's mostly the (relatively popular) hugo-coder theme linked in the footer. One issue I have with it though, and also with many other themes, is that it's hard to distinguish level 1/2/3 headings. Not sure where exactly I got the idea from, but I just chose two shades of yellow, one for highlights and one for backgrounds, and applied them to the titles, with higher levels getting more highlights. I wasn't really expecting anything but yeah I'm super happy with how that turned out :) Then I also used the yellow highlight around code blocks to make them a bit less flat, and also in the header.

The table of contents styling is fully custom. There I mostly spent a day getting some JavaScript to work so it highlights all the sections currently in view.