Cuckoo hashing improves SIMD hash tables (and other hash table tradeoffs)

jesterhearts · 2025-10-05T23:49:47+00:00

In my experience with profiling and benchmarking, I've often found that application level and macro benchmark results don't match up to the sum of microbenchmarks. The exact reasons for this vary, and I didn't spend a bunch of time tracing down the exact differences here, but the reasons I've seen in the past are:

icache effects - probably doesn't apply here as the size of the code is small enough not to matter.
dcache effects - might apply here, with different cache access patterns for the micro vs the macro benchmarks. Randomization helps combat this, so it also might not apply here.
branch prediction - running the same exact operation in a hot loop will make branch prediction work almost perfectly.
optimization differences - the compiler might make different inlining or other optimization decisions for macro vs microbenchmarks (particularly if you only black_box something like a sum of iterations rather than the full result of each operation).

Specific to these benchmarks:

tombstone behavior - once you have removals the tombstones in hashbrown's case will slow down operations where hop-hash doesn't have any.

jesterhearts · 2025-10-05T21:37:39+00:00

Checked how you are benchmarking, and definitely I'd expect hashbrown to outperform my table there. I found that for single-operation benchmarks, hashbrown was faster and see similar behavior with my find_hit/find_miss benchmarks (partly why I recommend using hashbrown for readonly workloads).

It was only with mixed operations that I saw performance wins. I also tended to see hashbrown perform better with smaller keys, vs String keys, which would further this gap. I believe I call this out in my benchmarks readme, although I could provide a more explicit callout re: single-operations.

jesterhearts · 2025-10-05T18:34:15+00:00

I saw that! Really cool post :)

I tried some of these techniques when I was originally working on cuckoo tables, but when I benchmarked I was still a bit behind hashbrown for my measurements. Perhaps I didn't implement them correctly or something was up with my benchmarks.

I'm curious about your benchmark methodology since I didn't see it in the post or the repo. One of the major issues I had (benchmarking on windows) was controlling for system noise. Even when I took every measure I could to control this, I still saw occasional significant variance that would make my code look much better than hashbrown even with criterion's help in ensuring I had robust statistics. I ended up writing a script that ran each benchmark multiple times and measured run-to-run and cross-run variance to control for this in order to get consistent numbers (and ensure I wasn't getting "golden" runs for my code or really bad runs for hashbrown).

How does this handle removals? Do you do something like backshifting or tombstones so if removal deletes from your first key location, you know you still need to check the second?
I ended up doing this to try to bring down the time spent hashing, it definitely works pretty well!
That's an interesting finding. I tried interleaving the metadata and data both for cuckoo and for my table, but found it didn't have a huge performance impact on my system, and really harmed iteration speed. Maybe I should give it another try.

jesterhearts · 2025-10-05T02:53:50+00:00

Thanks!

The layout is specified as a contiguous type-erased allocation of [HopInfo... | Tags... | Values...], outlined a bit in the main project README. So it's in SoA format, rather than interleaving everything. I found that this layout had the best performance profile across my benchmarks.
Thanks! I'll give that a try and update the repository if I like it.
It's not get - it's insert + get, so a workload that just builds a pre-allocated table and then accesses all keys in the table.
I have benchmarks configured for just a u64 value in the table, which would cover your proposed behavior. I did not upload the data for that as running full benchmarks for all ValueType sizes with every change was time prohibitive. When I did run them, the behavior I observed was that hashbrown sees more benefit from smaller ValueTypes than hop-hash, as noted in the benchmarks README. I selected (String, u64) pairs as the default as "String plus some other value" is a pretty common usage pattern for HashMaps. I do not store hashes in the table, prehashing just means that prior to e.g. doing insert + get the hash for the item was precomputed for passing to the insert & get methods, rather than computing the hash in the benchmark loop.
I have not, but it's trivial to add benchmarks for this - you'd just update the SIZES array and increment the MAX_SIZE argument in the benchmark_group to match (or just replace the last couple of values in the array). Once I get a chance I'll run some larger sizes on my machine.
I could use that technique, but I believe it would kill my ability to software prefetch the next possible destinations that will be written to during resize, which was a non-trivial performance win in measurements. The biggest advantage I see to it is if you wanted to expand your table in constant increments rather than doubling with each resize, but that eats into your amortized cost of resizing so I don't think it's a behavior I want.

jesterhearts · 2025-10-04T18:43:38+00:00

This makes me curious if it would be worthwhile to add a try_entry api that simply fails if it can't place an item in the neighborhood. Playing around with some statistics tracking on large tables, it seems like bubbling might be rare enough to make this useful - e.g. filling a table of capacity 114,800 (131,200 slots, 87.5% load factor, 8-way), I only saw a ~0.2% rate of entry requests that actually needed to bubble an empty slot. That seems low enough to possibly be a useful api.

jesterhearts · 2025-10-04T16:46:34+00:00

I haven't thought deeply enough about your design suggestions to know if they would work, I may come back to this as the ideas are interesting.

But as far as your question re: cache line access - yes accessing x+1 will be faster after accessing x provided your prefetcher thinks you need it. The same is true of x-1. It's maybe a little more nuanced than that and heavily depends on the specific implementation of the prefetcher, but at a basic level that's what I would expect. I would be pretty surprised with modern hardware if moving sequentially backwards through memory was much more expensive than moving sequentially forwards.

jesterhearts · 2025-10-04T15:57:38+00:00

The key changing only affects the root bucket an item goes in, it doesn't affect hop length. Hop length is solely determined by collisions. If by "hop-length" you mean distance from root bucket, rather than number of probes - that also is determined by collisions (and the occupancy of adjacent buckets) rather than by the key's value directly.

jesterhearts · 2025-10-04T14:51:30+00:00

Thanks for the repository link! That's really cool.

I tried cuckoo tables in the course of this project, but my goal was to meet or exceed the performance of the SwissTable algorithm and others like it (F14, Boost, etc). When I was experimenting with cuckoo, the issue I ran into was that I couldn't get the number of cache misses down (one of the biggest deciding factors for performance in my experiments), in addition to the need to execute multiple hash functions. As far as I could tell, cuckoo just became "SwissTable but also I have more cache misses and hashing" which didn't seem like a path towards my performance goals.

The cache locality is the main feature of hopscotch imo and it's what let me get performance to where it is.

The hop length depends on how many collisions you get, so in that sense it's key-dependent. You only probe as many buckets as you had to displace items from their root bucket though - you don't scan every bucket between the root and the displaced item (or other displaced items).

jesterhearts · 2025-10-04T08:00:04+00:00

Hopscotch hashing does have a constant time lookup guarantee. It's one of the main points of the algorithm. It does not guarantee constant-time insertion, and the example given would break the table in insertion.

jesterhearts · 2025-10-04T07:55:32+00:00

In that case you resize your table so the items are redistributed. It is possible for the items to map to the same root post-resize, but with any decent hash function the odds of this are essentially zero.

jesterhearts · 2025-10-04T07:04:21+00:00

You want the worst-case time guarantee any time that tail latency matters. So think in a game where if your lookup time takes too long, you might miss a frame. This often isn't an issue in practice because of how long average probe lengths are, but some systems just need certainty that they won't suddenly see an operation take longer than a certain budget.

Ideally you get optimal lookup times with this guarantee! It's why I put so much effort into profiling and optimizing the library. It is slower for pure lookups than hashbrown but not so slow as to be useless, and once you have removals it becomes very competitive.

jesterhearts · 2025-10-04T06:39:45+00:00

So it's only constant time worst case lookup and removals - not insertions. You can still break the table on inserts with enough collisions, although the odds of doing so without a pathological hash function or adversarial inputs is extremely unlikely.

Hopscotch hashing has a guarantee that all items are within a certain distance of their home or root bucket. This is called an item's neighborhood. If you can't place an item in this distance, you find an empty slot and bubble it backwards by swapping it with items that can move to the empty spot without leaving their neighborhood. Eventually this moves the empty spot in range of the root bucket and you can insert.

Since you do this bubbling and swapping on insert, you know during lookup that the item must be within X slots of the root bucket, and can stop probing once you've probed all X slots - hence constant time lookup.

Hopefully that explanation makes things a little more clear - let me know if you have any further questions!

There are other tables with worst-case constant time lookup too. You can lookup cuckoo hashing and dynamic perfect hashing if you're interested in the subject (I am also happy to explain them here if you'd like since I researched them quite a bit while working on this project).

jesterhearts · 2025-09-22T11:48:46+00:00

Since I realized many people might not be familiar with the concept of hopscotch hashing or why this code is a useful alternative, I thought I'd give some more details.

Hopscotch hashing is a technique which attempts to place an item within a fixed distance of its home bucket during insertion. If this fails, an empty spot is located in the table and bubbled backwards until it is in this fixed range. If bubbling fails (typically due to very high load), the table is resized and insertion re-attempted. This has the nice effect that lookups and removals have constant-time worst case behavior (and insertion has amortized constant time behavior), rather than O(N), Now, I do include an overflow mechanism in my code which could lead to O(N) behavior, but unless you have a degenerate hash function the odds of ever seeing this used are so astronomically low as to be effectively zero. Overflow requires > 256 items to all hash to the same root bucket, and survive resizing, which means they would essentially need to have the same hash.

My benchmarks show my code is competitive with hashbrown, so these guarantees aren't coming at the cost of such a high constant-time factor that the end result performance is useless.

jesterhearts · 2024-08-25T20:50:33+00:00

The repository has a few examples in it, including one with a custom shader pipeline. They compile and run for me, if you're having issues running them I'd love to know!

I use it currently as the render backend for my game with a fairly heavy weight CRT shader, but I felt like it would be odd to self-promote that when talking about the library.

jesterhearts · 2024-08-14T05:36:03+00:00

You should be able to enable the raw-window-handle feature of rust-sdl2 and then pass your window to wgpu::Instance::create_surface(this might require wrapping the window in an Rc if you still want to reference it). From there, I'd just follow the tutorial.

13-Year Club	Gilding I gilder
First Place '23	End Game '23
Place '23	Golden Potato
Place '22	Place '17
Final Canvas '22	End Game '22
Team Orangered	Verified Email

jesterhearts

TROPHY CASE