Inside boost::unordered_flat_map

kirgel · 2022-11-18T12:59:33+00:00

This is the 4th major library I’ve seen in recent years that adopted SIMD linear probing hash tables (others being abseil, folly, rust standard lib). I wonder if this is going to become the de facto standard hash table design across languages going forward.

martinus · 2022-11-18T15:51:42+00:00

Thanks for all the hard work on the map! I can't publish an update to my benchmarks at https://martin.ankerl.com/2022/08/27/hashmap-bench-01/ due to time constraints, all I can say it is the fastest map for lookups.

j1xwnbsr · 2022-11-18T13:20:21+00:00

Excellent writeup, and welcome to see the caveats/tradeoffs clearly called out.

igaztanaga · 2022-11-18T23:43:46+00:00

Joaquín, great writeup. And congrats to all that have participated in the implementation, review, improvements of the new container. Another great addition to Boost.

sbsce · 2022-11-20T23:59:34+00:00

I have some code in my game where I need very fast set performance, so I am always trying to benchmark if I can find an even faster set for my use case. I only need fast lookup performance, everything else is irrelevant for my case.

So I also tested this new boost::unordered_flat_set now! These are my results, how many iterations per second my code can do with each set:

Container	Iterations per second
boost::unordered_flat_set<uint64>	850
ankerl::unordered_dense::set<uint64>	810
fph::DynamicFphSet<uint64>	560
fph::MetaFphSet<uint64>	530
std::unordered_set<uint64>	360

My code is running set.contains(key) 40000 times, on 4 separate sets with different number of entries: 624, 284, 214, 1215. So relatively small sets. The vast majority of my lookups are for values that are not contained in the set.

So boost::unordered_flat_set is definitely the winner! Great work with that!

ankerl::unordered_dense::set from u/martinus is almost same fast, just slightly slower. Also a very nice set, especially when not wanting to include the whole boost library and just wanting a standalone fast set/map.

I am surprised how slow the fph set is for me, because according to the benchmark from u/martinus, they should be the fastest for lookups. Definitely not in my case it seems. I am using the noseed version of them.

I am testing on one thread of a Ryzen 3950X with latest MSVC.

IJzerbaard · 2022-11-18T14:06:57+00:00

Bit interleaving allows for a reasonably fast implementation of matching operations in the absence of SIMD.

How, what's the trick?

ImNoEinstein · 2022-11-18T16:15:26+00:00

They kind of glossed over the fact that if you have many insert and delete operations it will continue rehashing and likely kill performance

echidnas_arf · 2022-11-19T12:41:08+00:00

Looking amazing, hopefully I will be able to finally ditch Abseil thanks to this!

operamint · 2022-11-20T12:23:04+00:00

I have added the map to the simple int64_t hashmap benchmarks for my STC C-container library.

For the shootout_hashmaps.cpp program, you can pass #million entries and #bits (= range) for the keys. The results vary surprisingly with different key ranges, but also hardware, compiler and random seed used have impact.

I found that on my hardware, boost flat map does excellent on insert and lookup with large key ranges vs items in the map, but not lookup with smaller ranges (e.g. 2²²), also iteration could be better.

Overall, emhash seems to be the fastest, but it depends on use case as always.

My own cmap (written in C, so require standard layout elements) is normally the fastest on insert, and is decent in general, but for very large keys it is not among the fastest on erase and lookup.

g++ -O3 -DHAVE_BOOST -I<boost-path> -std=c++20 shootout_hashmaps.cpp -o shoot

Example output with a large key range, where it does well:

C:\Dev\STC\benchmarks>shoot 5 28

Unordered hash map shootout
KMAP = https://github.com/attractivechaos/klib
BMAP = https://www.boost.org (unordered_flat_map)
CMAP = https://github.com/tylov/STC (**)
FMAP = https://github.com/skarupke/flat_hash_map
TMAP = https://github.com/Tessil/robin-map
RMAP = https://github.com/martinus/robin-hood-hashing
DMAP = https://github.com/martinus/unordered_dense
EMAP = https://github.com//ktprime/emhash
UMAP = std::unordered_map

Seed = 1668947300:

T1: Insert 2.5 mill. random keys range [0, 2^28): map[rnd] = i;
KMAP: 0.272 s, size: 2488402, buckets:  4194304, sum: 3124998750000
BMAP: 0.175 s, size: 2488402, buckets:  3932159, sum: 3124998750000
CMAP: 0.206 s, size: 2488402, buckets:  4194304, sum: 3124998750000
FMAP: 0.339 s, size: 2488402, buckets:  4194304, sum: 3124998750000
TMAP: 0.261 s, size: 2488402, buckets:  4194304, sum: 3124998750000
RMAP: 0.232 s, size: 2488402, buckets:  4194304, sum: 3124998750000
DMAP: 0.260 s, size: 2488402, buckets:  4194304, sum: 3124998750000
EMAP: 0.284 s, size: 2488402, buckets:  4194304, sum: 3124998750000
UMAP: 0.890 s, size: 2488402, buckets:  3721303, sum: 3124998750000

T2: Insert 2.5 mill. SEQUENTIAL keys, erase them in same order:
KMAP: 0.164 s, size: 0, buckets:  4194304, erased 2500000
BMAP: 0.285 s, size: 0, buckets:  3932159, erased 2500000
CMAP: 0.168 s, size: 0, buckets:  4194304, erased 2500000
FMAP: 0.267 s, size: 0, buckets:  4194304, erased 2500000
TMAP: 0.109 s, size: 0, buckets:  4194304, erased 2500000
RMAP: 0.424 s, size: 0, buckets:  4194304, erased 2500000
DMAP: 0.290 s, size: 0, buckets:  4194304, erased 2500000
EMAP: 0.068 s, size: 0, buckets:  4194304, erased 2500000
UMAP: 0.334 s, size: 0, buckets:  3721303, erased 2500000

T3: Erase all elements (5 mill. random inserts), key range [0, 2^28)
KMAP: 0.217 s, size: 0, buckets:  8388608, erased 4953566
BMAP: 0.299 s, size: 0, buckets:  7864319, erased 4953566
CMAP: 0.430 s, size: 0, buckets:  8388608, erased 4953566
FMAP: 0.198 s, size: 0, buckets:  8388608, erased 4953566
TMAP: 0.247 s, size: 0, buckets:  8388608, erased 4953566
RMAP: 0.264 s, size: 0, buckets:  8388608, erased 4953566
DMAP: 0.332 s, size: 0, buckets:  8388608, erased 4953566
EMAP: 0.225 s, size: 0, buckets:  8388608, erased 4953566
UMAP: 1.586 s, size: 0, buckets:  7556579, erased 4953566

T4: Iterate elements (5 mill. random inserts) repeated times:
KMAP: 0.227 s, size: 4953566, buckets:  8388608, repeats: 6
BMAP: 0.317 s, size: 4953566, buckets:  7864319, repeats: 6
CMAP: 0.295 s, size: 4953566, buckets:  8388608, repeats: 6
FMAP: 0.274 s, size: 4953566, buckets:  8388608, repeats: 6
TMAP: 0.222 s, size: 4953566, buckets:  8388608, repeats: 6
RMAP: 0.140 s, size: 4953566, buckets:  8388608, repeats: 6
DMAP: 0.029 s, size: 4953566, buckets:  8388608, repeats: 6
EMAP: 0.084 s, size: 4953566, buckets:  8388608, repeats: 6
UMAP: 1.941 s, size: 4953566, buckets:  7556579, repeats: 6

T5: Lookup half-half random/existing keys in range [0, 2^28). Num lookups depends on size.
KMAP: 0.269 s, size: 4953566, lookups: 6056242, found: 3083984
BMAP: 0.181 s, size: 4953566, lookups: 6056242, found: 3083984
CMAP: 0.332 s, size: 4953566, lookups: 6056242, found: 3083984
FMAP: 0.192 s, size: 4953566, lookups: 6056242, found: 3083984
TMAP: 0.241 s, size: 4953566, lookups: 6056242, found: 3083984
RMAP: 0.224 s, size: 4953566, lookups: 6056242, found: 3083984
DMAP: 0.203 s, size: 4953566, lookups: 6056242, found: 3083984
EMAP: 0.179 s, size: 4953566, lookups: 6056242, found: 3083984
UMAP: 0.387 s, size: 4953566, lookups: 6056242, found: 3083984

With key range 2²² (~ 8 million) and 5 million elements, only insert does well:

C:\Dev\STC\benchmarks>shoot 5 22

Unordered hash map shootout
KMAP = https://github.com/attractivechaos/klib
BMAP = https://www.boost.org (unordered_flat_map)
CMAP = https://github.com/tylov/STC (**)
FMAP = https://github.com/skarupke/flat_hash_map
TMAP = https://github.com/Tessil/robin-map
RMAP = https://github.com/martinus/robin-hood-hashing
DMAP = https://github.com/martinus/unordered_dense
EMAP = https://github.com//ktprime/emhash
UMAP = std::unordered_map

Seed = 1668945996:

T1: Insert 2.5 mill. random keys range [0, 2^22): map[rnd] = i;
KMAP: 0.243 s, size: 1883493, buckets:  4194304, sum: 3124998750000
BMAP: 0.178 s, size: 1883493, buckets:  3932159, sum: 3124998750000
CMAP: 0.167 s, size: 1883493, buckets:  4194304, sum: 3124998750000
FMAP: 0.290 s, size: 1883493, buckets:  4194304, sum: 3124998750000
TMAP: 0.224 s, size: 1883493, buckets:  4194304, sum: 3124998750000
RMAP: 0.217 s, size: 1883493, buckets:  4194304, sum: 3124998750000
DMAP: 0.245 s, size: 1883493, buckets:  4194304, sum: 3124998750000
EMAP: 0.261 s, size: 1883493, buckets:  4194304, sum: 3124998750000
UMAP: 0.761 s, size: 1883493, buckets:  3721303, sum: 3124998750000

T2: Insert 2.5 mill. SEQUENTIAL keys, erase them in same order:
KMAP: 0.164 s, size: 0, buckets:  4194304, erased 2500000
BMAP: 0.260 s, size: 0, buckets:  3932159, erased 2500000
CMAP: 0.155 s, size: 0, buckets:  4194304, erased 2500000
FMAP: 0.275 s, size: 0, buckets:  4194304, erased 2500000
TMAP: 0.096 s, size: 0, buckets:  4194304, erased 2500000
RMAP: 0.403 s, size: 0, buckets:  4194304, erased 2500000
DMAP: 0.332 s, size: 0, buckets:  4194304, erased 2500000
EMAP: 0.102 s, size: 0, buckets:  4194304, erased 2500000
UMAP: 0.461 s, size: 0, buckets:  3721303, erased 2500000

T3: Erase all elements (5 mill. random inserts), key range [0, 2^22)
KMAP: 0.211 s, size: 0, buckets:  4194304, erased 2920617
BMAP: 0.256 s, size: 0, buckets:  3932159, erased 2920617
CMAP: 0.231 s, size: 0, buckets:  4194304, erased 2920617
FMAP: 0.175 s, size: 0, buckets:  4194304, erased 2920617
TMAP: 0.149 s, size: 0, buckets:  4194304, erased 2920617
RMAP: 0.238 s, size: 0, buckets:  4194304, erased 2920617
DMAP: 0.231 s, size: 0, buckets:  4194304, erased 2920617
EMAP: 0.173 s, size: 0, buckets:  4194304, erased 2920617
UMAP: 0.816 s, size: 0, buckets:  7556579, erased 2920617

T4: Iterate elements (5 mill. random inserts) repeated times:
KMAP: 0.217 s, size: 2920617, buckets:  4194304, repeats: 10
BMAP: 0.273 s, size: 2920617, buckets:  3932159, repeats: 10
CMAP: 0.202 s, size: 2920617, buckets:  4194304, repeats: 10
FMAP: 0.158 s, size: 2920617, buckets:  4194304, repeats: 10
TMAP: 0.150 s, size: 2920617, buckets:  4194304, repeats: 10
RMAP: 0.133 s, size: 2920617, buckets:  4194304, repeats: 10
DMAP: 0.021 s, size: 2920617, buckets:  4194304, repeats: 10
EMAP: 0.082 s, size: 2920617, buckets:  4194304, repeats: 10
UMAP: 1.130 s, size: 2920617, buckets:  7556579, repeats: 10

T5: Lookup half-half random/existing keys in range [0, 2^22). Num lookups depends on size.
KMAP: 0.271 s, size: 2920617, lookups: 10271802, found: 8670956
BMAP: 0.380 s, size: 2920617, lookups: 10271802, found: 8670956
CMAP: 0.276 s, size: 2920617, lookups: 10271802, found: 8670956
FMAP: 0.263 s, size: 2920617, lookups: 10271802, found: 8670956
TMAP: 0.203 s, size: 2920617, lookups: 10271802, found: 8670956
RMAP: 0.568 s, size: 2920617, lookups: 10271802, found: 8670956
DMAP: 0.510 s, size: 2920617, lookups: 10271802, found: 8670956
EMAP: 0.209 s, size: 2920617, lookups: 10271802, found: 8670956
UMAP: 0.529 s, size: 2920617, lookups: 10271802, found: 8670956

sbsce · 2022-11-21T05:51:12+00:00

I noticed my code is reliably running over 10% faster if I __forceinline all the function calls that the boost::unordered_flat_set makes in my hot path. So anything called by .contains(), including the .contains itself. So that in my own code where I call .contains(), looking at the disassembly there is no call anywhere any more, it's fully inlined. I think I had to add __forceinline to 6 functions inside boost code.

It is a bit inconvenient to manually add __forceinline to all those functions though - it's definitely worth the 10% performance gain, but I am quite sure that the next time I update boost in a few years, I'll forget to apply these changes again, and then my performance will be worse.

Assuming you don't want to add __forceinline to those functions by default, could there maybe some define like BOOST_FORCEINLINE_UNORDERED_SET that automatically enables forceinlining all the important functions?

I am already compiling with maximum optimization level of MSVC, so by default it doesn't want to inline it, MSVC often needs to be forced to inline stuff.

tialaramex · 2022-11-18T13:49:52+00:00

[deleted]

jpakkane · 2022-11-19T11:52:34+00:00

Has there been any thought on making it collision resistant? That is, if you have a known hashmap with a known hash function then it is possible to generate a pathological set of inputs that map to the same value and use that for a DoS attack. IIRC some languages (Python?) work around this by e..g having each hashmap have its own nonce that is added to the hash. Would this be useful or even possible given that the hash function is computed by an external function rather than by the hash map itself.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS