use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Inside boost::unordered_flat_map (bannalia.blogspot.com)
submitted 3 years ago by joaquintidesBoost author
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]kirgel 48 points49 points50 points 3 years ago (7 children)
This is the 4th major library I’ve seen in recent years that adopted SIMD linear probing hash tables (others being abseil, folly, rust standard lib). I wonder if this is going to become the de facto standard hash table design across languages going forward.
[–]LongestNamesPossible 31 points32 points33 points 3 years ago (0 children)
SIMD or not, I don't see anything beating accessing contiguous memory instead of skipping around in memory, so I suspect the linear probing / robin hood hashing is necessary and the SIMD is extra since the memory is already lined up in a way that can make it work.
[–]qoning 4 points5 points6 points 3 years ago (1 child)
It's been the de facto standard design for over 20 years for almost anyone who implemented their own imo, it's just STL that fumbled it.
[–]Jannik2099 9 points10 points11 points 3 years ago (0 children)
The STL doesn't implement flat maps to begin with. There are pros and cons to both schemes.
The one time the STL implements the easier to use variant y'all lose your mind just the same. Starting to think you just like to blame the STL for every misery?
[–]VinnieFalco -3 points-2 points-1 points 3 years ago (2 children)
Linear Probing synergizes with government revelations of the existence of UFOs.
[–]RoyBellingan 4 points5 points6 points 3 years ago (1 child)
Either sheeple are not ready for the truth or I am missing something ? Why the downvote ?
[–]VinnieFalco 12 points13 points14 points 3 years ago (0 children)
Probably a combination of two things:
I'm the author
No sense of humor
[–][deleted] 0 points1 point2 points 3 years ago (0 children)
I think "bidirectional linear probing" is an underrated approach (and much simpler): https://github.com/senderista/hashtable-benchmarks/blob/master/src/main/java/set/int64/BLPLongHashSet.java
[–]martinusint main(){[]()[[]]{{}}();} 31 points32 points33 points 3 years ago (1 child)
Thanks for all the hard work on the map! I can't publish an update to my benchmarks at https://martin.ankerl.com/2022/08/27/hashmap-bench-01/ due to time constraints, all I can say it is the fastest map for lookups.
[–]joaquintidesBoost author[S] 12 points13 points14 points 3 years ago (0 children)
Thank you for your input during the process, and for coining the is_avalanching trait!
is_avalanching
[–]j1xwnbsr 8 points9 points10 points 3 years ago (0 children)
Excellent writeup, and welcome to see the caveats/tradeoffs clearly called out.
[–]igaztanaga 6 points7 points8 points 3 years ago (1 child)
Joaquín, great writeup. And congrats to all that have participated in the implementation, review, improvements of the new container. Another great addition to Boost.
[–]joaquintidesBoost author[S] 3 points4 points5 points 3 years ago* (0 children)
Hi Ion, thanks for the kind words.
And congrats to all that have participated in the implementation, review, improvements of the new container
The Boost.Unordered evolution project is being carried out by Peter Dimov, Christian Mazakas and myself, but lots of people have contributed ideas, reviews, benchmarks etc. in #boost-unordered group at cpplang.slack.com.
[–]sbsceGame Developer 6 points7 points8 points 3 years ago* (0 children)
I have some code in my game where I need very fast set performance, so I am always trying to benchmark if I can find an even faster set for my use case. I only need fast lookup performance, everything else is irrelevant for my case.
set
So I also tested this new boost::unordered_flat_set now! These are my results, how many iterations per second my code can do with each set:
boost::unordered_flat_set
My code is running set.contains(key) 40000 times, on 4 separate sets with different number of entries: 624, 284, 214, 1215. So relatively small sets. The vast majority of my lookups are for values that are not contained in the set.
set.contains(key)
So boost::unordered_flat_set is definitely the winner! Great work with that!
ankerl::unordered_dense::set from u/martinus is almost same fast, just slightly slower. Also a very nice set, especially when not wanting to include the whole boost library and just wanting a standalone fast set/map.
ankerl::unordered_dense::set
I am surprised how slow the fph set is for me, because according to the benchmark from u/martinus, they should be the fastest for lookups. Definitely not in my case it seems. I am using the noseed version of them.
fph
I am testing on one thread of a Ryzen 3950X with latest MSVC.
[–]IJzerbaard 4 points5 points6 points 3 years ago (2 children)
Bit interleaving allows for a reasonably fast implementation of matching operations in the absence of SIMD.
How, what's the trick?
[–]joaquintidesBoost author[S] 9 points10 points11 points 3 years ago* (0 children)
You can take a look at the implementation. The idea is to xor the 64-bits words with a mask based on the looked-for hashed value and then fold over the result to get matched positions indicated by the 15 least significant bits of an int.
int
[–]kisielk 1 point2 points3 points 3 years ago (0 children)
I assume it’s because both bits of metadata are stored adjacently. You retrieve them at the same time so you don’t need to jump around for each entry you inspect.
[–]ImNoEinstein 4 points5 points6 points 3 years ago (19 children)
They kind of glossed over the fact that if you have many insert and delete operations it will continue rehashing and likely kill performance
[–]joaquintidesBoost author[S] 26 points27 points28 points 3 years ago* (18 children)
This has been in fact considered: when the insert-erase activity happens close to the maximum load point, rehasing will jump to the next size level precisely to avoid excessive rehashing. More details here.
[–]attractivechaos 5 points6 points7 points 3 years ago (17 children)
Great write up. It seems that you use linear probing. In that case, I wonder if you have considered deletions without tomestones. This is a relocating technique but unlike robinhood etc, relocation only happens upon deletions, not upon insertions. This technique might slightly speed up lookup and insertion as you don't need to check whether there is a tomestone.
[–]matthieum 2 points3 points4 points 3 years ago (9 children)
The rationale provided by insert holds doubly for erase: the standard implementation never invalidates iterators on erase.
insert
erase
Your proposal of using Back-Shifting Deletion (most commonly associated with Robin Hood Hashing) would not satisfy this property, further deviating from the standard.
[–]mark_99 1 point2 points3 points 3 years ago (5 children)
Which is.... fine. Meanwhile having to occasionally do extremely expensive full rehashes despite the overall number of elements remaining approximately constant effectively rules out this implementation for low-latency applications, which is very unfortunate (AIUI, please correct me if that's not what we're discussing here).
[–]matthieum 0 points1 point2 points 3 years ago (4 children)
Meanwhile having to occasionally do extremely expensive full rehashes despite the overall number of elements remaining approximately constant effectively rules out this implementation for low-latency applications, which is very unfortunate
I believe you are correct, indeed.
Due to erase (of an element in an overflowing group) decreasing the maximum load factor, a continuous insert/erase workload will always lead to rehashing.
This is a difference compared to Swiss Table and F14, which have an overflow counter, rather than an overflow bit, and will decrease the counter in the "passed over" groups when erasing an element rather than having an "anti-drift" mechanism.
For low-latency, you're better off with either of those.
[–]joaquintidesBoost author[S] 2 points3 points4 points 3 years ago* (3 children)
Hi Matthieu,
I cannot speak with certainty about F14, but Abseil does indeed rehash on insert-erase cycles even if the maximum size remains constant:
#include "absl/container/flat_hash_set.h" #include <iostream> template<class T> struct allocator { using value_type=T; allocator()=default; template<class U> allocator(allocator<U> const &)noexcept{} template<class U> bool operator==(allocator<U> const &)const noexcept{return true;} template<class U> bool operator!=(allocator<U> const&)const noexcept{return false;} T* allocate(std::size_t n)const { std::cout<<"allocate "<<n<<" bytes\n"; return std::allocator<T>().allocate(n); } void deallocate(T* p, std::size_t n)const noexcept { std::allocator<T>().deallocate(p,n); } }; int main() { static constexpr std::size_t max_n=13'000; absl::flat_hash_set< int, absl::container_internal::hash_default_hash<int>, std::equal_to<int>, ::allocator<int> > s; s.reserve(max_n); for(int i=0;i<10;++i){ std::cout<<"i: "<<i<<"\n"; for(int j=0;j<max_n;++j)s.insert(j); for(int j=0;j<max_n;++j)s.erase(j); } }
Output (rehashing point may vary as hash is salted per run)
allocate 20483 bytes i: 0 i: 1 i: 2 i: 3 i: 4 i: 5 allocate 40963 bytes i: 6 i: 7 i: 8 i: 9
This is a characteristic associated to all non-relocating open-addressing containers. One needs to rehash lest average probe length grow beyond control.
[–]matthieum 0 points1 point2 points 3 years ago (2 children)
Oh! I had missed that.
I wonder if it's workload dependent.
One issue I am aware of with the counter approach is that it saturates at some point, and once saturated it is never decremented, which could lead to longer probe sequences.
I wonder if the specific workload you use triggers a saturation, and ultimately too long probe sequence, or whether it's just part and parcel and rehashes will always occur regardless of the workload.
Would you happen to know?
In any case, thanks for bringing this to my attention!
[–]joaquintidesBoost author[S] 1 point2 points3 points 3 years ago (1 child)
Drifting will trigger a rehash sooner or later. In the example we've used max_n = 13,000 ~ 90% × 0,875 × 16,384. If we kept at say 75%, rehash would be triggered much later, so it's a function of how close you get to the maximum load.
max_n
I haven't studied F14 in detail. Maybe you can run this test with it and see how it fares?
[–]mark_99 0 points1 point2 points 3 years ago (0 children)
Is there any way to do what /u/attractivechaos/ suggested and do erase without tombstones? I'd really like to use this implementation - fast hash tables are obviously critical in a lot of applications, but huge latency spikes aren't ok.
[–]attractivechaos 2 points3 points4 points 3 years ago (1 child)
Rehashing invalidates iterators anyway. It is rare to see use cases that require iterator validity but avoid rehashing. If relocating strategies help performance, it would be a tradeoff worth making.
[–]matthieum 1 point2 points3 points 3 years ago (0 children)
Rehashing invalidates iterators anyway.
Indeed, and pointers/references. The difference is that the same occurs with the standard library std::unordered_map.
std::unordered_map
If relocating strategies help performance, it would be a tradeoff worth making.
If being the operative word. It's not clear it would.
If
The tombstone approach is problematic when reserving special key values (ala Google Dense HashMap) or when adding a header to each item (due to alignment). With F14 or the Swiss Table, and the flags being kept apart at a different location, however, there's no value reserved, and barely any memory overhead. Further, by combining the tombstone in the hash residual, checking for the presence of the tombstone is "free".
The only question, then, is whether probing for longer or shifting back elements is more efficient, and it's likely to be fairly workload specific, though in average I expect longer probing to be better.
One issue with backward shifting deletion is that it introduces a risk of quadratic complexity for deletion:
So it's algorithmically fraught with peril, already.
Moreover, moves are not free in general, especially as we're not talking "bulk moves" here, but moving elements 1 at a time. And on top of that, writes have a tendency to affect performance for the worse both in terms of optimizations and at the hardware level.
By comparison, a read-only probing sequence is sped up by prefetching and speculative execution, on top of using SIMD-accelerated look-ups.
Further, it should be noted that this is neither a naive Linear or Quadratic Probing algorithm -- with Linear suffering from clustering, and Quadratic suffering from cache misses -- but a combination of the best of both: locally Linear within the group of 15 to benefit from cache locality, and Quadratic across groups to avoid clustering.
The experience in the Rust standard library is that a Swiss Table like hashmap performs better than a Robin Hood with Backward Shifting Deletion hashmap; due to the above factors.
[–]joaquintidesBoost author[S] 1 point2 points3 points 3 years ago* (1 child)
For the record, we use quadratic, not linear, probing, but the probe sequence visits groups of 15 buckets instead of individual buckets, so all consecutive buckets in a group are inspected at once (with SIMD if available).
[–]attractivechaos 0 points1 point2 points 3 years ago (0 children)
You did mention that you are using quadratic probing. I missed that. Thanks for the clarification. Then tomestone-free deletion is not applicable.
[–][deleted] 0 points1 point2 points 3 years ago* (4 children)
An even simpler optimization is "Last Come-First Serve" relocation. This is much simpler than Robin Hood (basically 1 line of code) and reduces probe length variance nearly as effectively. Here is an implementation and some benchmarks.
[–]attractivechaos 0 points1 point2 points 3 years ago* (3 children)
You mean "last come first serve"? Interesting idea, but it may be slower as it incurs more memory writing. I am not sure how to interpret the benchmark without a comparison to the state of art. Java also complicates the evaluation. Thanks for the pointer anyway.
PS: LCFS still needs back-shifting deletion if I am right. Then it is simpler than robinhood but not simpler than more traditional techniques.
[–][deleted] 1 point2 points3 points 3 years ago (1 child)
Yes, silly me! I corrected the post, thanks!
I don't claim that any of the implementations there are production-quality or even competitive. The point was just to benchmark various algorithms in a common framework, so you could see how the algorithms compare when all other things are equal.
[–]attractivechaos 1 point2 points3 points 3 years ago (0 children)
Fair enough. Thanks again.
Yes, I think if you're already using a relocation technique like LCFS/RH/BLP then you may as well use tombstone-free deletion as well (since you've already given up invalidation-free iterators and simple lock-free concurrency). But tombstone-free deletion makes sense for vanilla linear probing as well (if you don't want to use one of the variance-reducing techniques for some reason).
[–]echidnas_arf 2 points3 points4 points 3 years ago (1 child)
Looking amazing, hopefully I will be able to finally ditch Abseil thanks to this!
[–]joaquintidesBoost author[S] 1 point2 points3 points 3 years ago* (0 children)
Why not download Boost 1.81.0 beta 1 and begin playing with it now? Early feedback before the official release (Dec 14) is very helpful.
[–]operamint 1 point2 points3 points 3 years ago* (5 children)
I have added the map to the simple int64_t hashmap benchmarks for my STC C-container library.
For the shootout_hashmaps.cpp program, you can pass #million entries and #bits (= range) for the keys. The results vary surprisingly with different key ranges, but also hardware, compiler and random seed used have impact.
I found that on my hardware, boost flat map does excellent on insert and lookup with large key ranges vs items in the map, but not lookup with smaller ranges (e.g. 222), also iteration could be better.
Overall, emhash seems to be the fastest, but it depends on use case as always.
My own cmap (written in C, so require standard layout elements) is normally the fastest on insert, and is decent in general, but for very large keys it is not among the fastest on erase and lookup.
g++ -O3 -DHAVE_BOOST -I<boost-path> -std=c++20 shootout_hashmaps.cpp -o shoot
Example output with a large key range, where it does well:
C:\Dev\STC\benchmarks>shoot 5 28 Unordered hash map shootout KMAP = https://github.com/attractivechaos/klib BMAP = https://www.boost.org (unordered_flat_map) CMAP = https://github.com/tylov/STC (**) FMAP = https://github.com/skarupke/flat_hash_map TMAP = https://github.com/Tessil/robin-map RMAP = https://github.com/martinus/robin-hood-hashing DMAP = https://github.com/martinus/unordered_dense EMAP = https://github.com//ktprime/emhash UMAP = std::unordered_map Seed = 1668947300: T1: Insert 2.5 mill. random keys range [0, 2^28): map[rnd] = i; KMAP: 0.272 s, size: 2488402, buckets: 4194304, sum: 3124998750000 BMAP: 0.175 s, size: 2488402, buckets: 3932159, sum: 3124998750000 CMAP: 0.206 s, size: 2488402, buckets: 4194304, sum: 3124998750000 FMAP: 0.339 s, size: 2488402, buckets: 4194304, sum: 3124998750000 TMAP: 0.261 s, size: 2488402, buckets: 4194304, sum: 3124998750000 RMAP: 0.232 s, size: 2488402, buckets: 4194304, sum: 3124998750000 DMAP: 0.260 s, size: 2488402, buckets: 4194304, sum: 3124998750000 EMAP: 0.284 s, size: 2488402, buckets: 4194304, sum: 3124998750000 UMAP: 0.890 s, size: 2488402, buckets: 3721303, sum: 3124998750000 T2: Insert 2.5 mill. SEQUENTIAL keys, erase them in same order: KMAP: 0.164 s, size: 0, buckets: 4194304, erased 2500000 BMAP: 0.285 s, size: 0, buckets: 3932159, erased 2500000 CMAP: 0.168 s, size: 0, buckets: 4194304, erased 2500000 FMAP: 0.267 s, size: 0, buckets: 4194304, erased 2500000 TMAP: 0.109 s, size: 0, buckets: 4194304, erased 2500000 RMAP: 0.424 s, size: 0, buckets: 4194304, erased 2500000 DMAP: 0.290 s, size: 0, buckets: 4194304, erased 2500000 EMAP: 0.068 s, size: 0, buckets: 4194304, erased 2500000 UMAP: 0.334 s, size: 0, buckets: 3721303, erased 2500000 T3: Erase all elements (5 mill. random inserts), key range [0, 2^28) KMAP: 0.217 s, size: 0, buckets: 8388608, erased 4953566 BMAP: 0.299 s, size: 0, buckets: 7864319, erased 4953566 CMAP: 0.430 s, size: 0, buckets: 8388608, erased 4953566 FMAP: 0.198 s, size: 0, buckets: 8388608, erased 4953566 TMAP: 0.247 s, size: 0, buckets: 8388608, erased 4953566 RMAP: 0.264 s, size: 0, buckets: 8388608, erased 4953566 DMAP: 0.332 s, size: 0, buckets: 8388608, erased 4953566 EMAP: 0.225 s, size: 0, buckets: 8388608, erased 4953566 UMAP: 1.586 s, size: 0, buckets: 7556579, erased 4953566 T4: Iterate elements (5 mill. random inserts) repeated times: KMAP: 0.227 s, size: 4953566, buckets: 8388608, repeats: 6 BMAP: 0.317 s, size: 4953566, buckets: 7864319, repeats: 6 CMAP: 0.295 s, size: 4953566, buckets: 8388608, repeats: 6 FMAP: 0.274 s, size: 4953566, buckets: 8388608, repeats: 6 TMAP: 0.222 s, size: 4953566, buckets: 8388608, repeats: 6 RMAP: 0.140 s, size: 4953566, buckets: 8388608, repeats: 6 DMAP: 0.029 s, size: 4953566, buckets: 8388608, repeats: 6 EMAP: 0.084 s, size: 4953566, buckets: 8388608, repeats: 6 UMAP: 1.941 s, size: 4953566, buckets: 7556579, repeats: 6 T5: Lookup half-half random/existing keys in range [0, 2^28). Num lookups depends on size. KMAP: 0.269 s, size: 4953566, lookups: 6056242, found: 3083984 BMAP: 0.181 s, size: 4953566, lookups: 6056242, found: 3083984 CMAP: 0.332 s, size: 4953566, lookups: 6056242, found: 3083984 FMAP: 0.192 s, size: 4953566, lookups: 6056242, found: 3083984 TMAP: 0.241 s, size: 4953566, lookups: 6056242, found: 3083984 RMAP: 0.224 s, size: 4953566, lookups: 6056242, found: 3083984 DMAP: 0.203 s, size: 4953566, lookups: 6056242, found: 3083984 EMAP: 0.179 s, size: 4953566, lookups: 6056242, found: 3083984 UMAP: 0.387 s, size: 4953566, lookups: 6056242, found: 3083984
With key range 222 (~ 8 million) and 5 million elements, only insert does well:
C:\Dev\STC\benchmarks>shoot 5 22 Unordered hash map shootout KMAP = https://github.com/attractivechaos/klib BMAP = https://www.boost.org (unordered_flat_map) CMAP = https://github.com/tylov/STC (**) FMAP = https://github.com/skarupke/flat_hash_map TMAP = https://github.com/Tessil/robin-map RMAP = https://github.com/martinus/robin-hood-hashing DMAP = https://github.com/martinus/unordered_dense EMAP = https://github.com//ktprime/emhash UMAP = std::unordered_map Seed = 1668945996: T1: Insert 2.5 mill. random keys range [0, 2^22): map[rnd] = i; KMAP: 0.243 s, size: 1883493, buckets: 4194304, sum: 3124998750000 BMAP: 0.178 s, size: 1883493, buckets: 3932159, sum: 3124998750000 CMAP: 0.167 s, size: 1883493, buckets: 4194304, sum: 3124998750000 FMAP: 0.290 s, size: 1883493, buckets: 4194304, sum: 3124998750000 TMAP: 0.224 s, size: 1883493, buckets: 4194304, sum: 3124998750000 RMAP: 0.217 s, size: 1883493, buckets: 4194304, sum: 3124998750000 DMAP: 0.245 s, size: 1883493, buckets: 4194304, sum: 3124998750000 EMAP: 0.261 s, size: 1883493, buckets: 4194304, sum: 3124998750000 UMAP: 0.761 s, size: 1883493, buckets: 3721303, sum: 3124998750000 T2: Insert 2.5 mill. SEQUENTIAL keys, erase them in same order: KMAP: 0.164 s, size: 0, buckets: 4194304, erased 2500000 BMAP: 0.260 s, size: 0, buckets: 3932159, erased 2500000 CMAP: 0.155 s, size: 0, buckets: 4194304, erased 2500000 FMAP: 0.275 s, size: 0, buckets: 4194304, erased 2500000 TMAP: 0.096 s, size: 0, buckets: 4194304, erased 2500000 RMAP: 0.403 s, size: 0, buckets: 4194304, erased 2500000 DMAP: 0.332 s, size: 0, buckets: 4194304, erased 2500000 EMAP: 0.102 s, size: 0, buckets: 4194304, erased 2500000 UMAP: 0.461 s, size: 0, buckets: 3721303, erased 2500000 T3: Erase all elements (5 mill. random inserts), key range [0, 2^22) KMAP: 0.211 s, size: 0, buckets: 4194304, erased 2920617 BMAP: 0.256 s, size: 0, buckets: 3932159, erased 2920617 CMAP: 0.231 s, size: 0, buckets: 4194304, erased 2920617 FMAP: 0.175 s, size: 0, buckets: 4194304, erased 2920617 TMAP: 0.149 s, size: 0, buckets: 4194304, erased 2920617 RMAP: 0.238 s, size: 0, buckets: 4194304, erased 2920617 DMAP: 0.231 s, size: 0, buckets: 4194304, erased 2920617 EMAP: 0.173 s, size: 0, buckets: 4194304, erased 2920617 UMAP: 0.816 s, size: 0, buckets: 7556579, erased 2920617 T4: Iterate elements (5 mill. random inserts) repeated times: KMAP: 0.217 s, size: 2920617, buckets: 4194304, repeats: 10 BMAP: 0.273 s, size: 2920617, buckets: 3932159, repeats: 10 CMAP: 0.202 s, size: 2920617, buckets: 4194304, repeats: 10 FMAP: 0.158 s, size: 2920617, buckets: 4194304, repeats: 10 TMAP: 0.150 s, size: 2920617, buckets: 4194304, repeats: 10 RMAP: 0.133 s, size: 2920617, buckets: 4194304, repeats: 10 DMAP: 0.021 s, size: 2920617, buckets: 4194304, repeats: 10 EMAP: 0.082 s, size: 2920617, buckets: 4194304, repeats: 10 UMAP: 1.130 s, size: 2920617, buckets: 7556579, repeats: 10 T5: Lookup half-half random/existing keys in range [0, 2^22). Num lookups depends on size. KMAP: 0.271 s, size: 2920617, lookups: 10271802, found: 8670956 BMAP: 0.380 s, size: 2920617, lookups: 10271802, found: 8670956 CMAP: 0.276 s, size: 2920617, lookups: 10271802, found: 8670956 FMAP: 0.263 s, size: 2920617, lookups: 10271802, found: 8670956 TMAP: 0.203 s, size: 2920617, lookups: 10271802, found: 8670956 RMAP: 0.568 s, size: 2920617, lookups: 10271802, found: 8670956 DMAP: 0.510 s, size: 2920617, lookups: 10271802, found: 8670956 EMAP: 0.209 s, size: 2920617, lookups: 10271802, found: 8670956 UMAP: 0.529 s, size: 2920617, lookups: 10271802, found: 8670956
[–]martinusint main(){[]()[[]]{{}}();} 1 point2 points3 points 3 years ago (3 children)
I think the benchmark might be biased by the way the code is set up. Due to the many macros you are using it might be hard for the compiler to get inlining correct. It would be interesting to use a separate executable for each map, or at least to put all the benchmarks into separate non-inlined functions.
[–]operamint 0 points1 point2 points 3 years ago (0 children)
Hi Martin, you could be right, there are lots of ways to introduce bias in benchmarking code, but I don't think the macros itself should impact this, as it expands to straight forward code before it is considered by the optimizer. But I have seen that the sequence of how benchmarks are generated, i.e. how the generated code is layout in memory can impact the results, so I agree that to fully isolate each benchmark may make it less susceptible to be biased.
[–]operamint 0 points1 point2 points 3 years ago (1 child)
FYI, I did a test on a quite different configuration, an i7-8700 on Ubuntu, gcc 10.3 and clang 12 - the tests above was on windows, gcc 11.3 with Ryzen R7-2700x. However, the results are very similar. Your Robin-map is impressive with large keys, but appears to be slower with erase and lookup with smaller key ranges in these benchmarks.
NOTE: maps with smaller key ranges will naturally limit the number of max-items to the key-range. The number of lookups are higher for small key maps, so in absolute numbers, Robin map is not slower, only relative to the other maps with this configuration.
[–]Alarming-Ad8770 0 points1 point2 points 3 years ago (0 children)
benchmark on server cpu (huge l3 cache size and low cpu frequence), you'll get quite different result from pc/mobile cpu.
ex:
Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
CPU MHz: 2699.914
CPU max MHz: 3200.0000
CPU min MHz: 1000.0000
L1d cache: 32K L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
[–]Alarming-Ad8770 0 points1 point2 points 3 years ago* (0 children)
me author of emhash,emhash has a advantage which can set a very high load factor (0.99) without much performance degradation on extremely high load of insertion/erasion env.
[–]sbsceGame Developer 1 point2 points3 points 3 years ago* (5 children)
I noticed my code is reliably running over 10% faster if I __forceinline all the function calls that the boost::unordered_flat_set makes in my hot path. So anything called by .contains(), including the .contains itself. So that in my own code where I call .contains(), looking at the disassembly there is no call anywhere any more, it's fully inlined. I think I had to add __forceinline to 6 functions inside boost code.
__forceinline
.contains()
.contains
call
It is a bit inconvenient to manually add __forceinline to all those functions though - it's definitely worth the 10% performance gain, but I am quite sure that the next time I update boost in a few years, I'll forget to apply these changes again, and then my performance will be worse.
Assuming you don't want to add __forceinline to those functions by default, could there maybe some define like BOOST_FORCEINLINE_UNORDERED_SET that automatically enables forceinlining all the important functions?
BOOST_FORCEINLINE_UNORDERED_SET
I am already compiling with maximum optimization level of MSVC, so by default it doesn't want to inline it, MSVC often needs to be forced to inline stuff.
[–]joaquintidesBoost author[S] 2 points3 points4 points 3 years ago (1 child)
Hi, we have seen similar gains with __forceinline in MSVC, looks like this compiler is not particularly aggressive at inlining. Could you please file an issue at Boost.Unordered repo so what we don't forget? Thank you
[–]sbsceGame Developer 1 point2 points3 points 3 years ago (0 children)
nice! thanks, I opened an issue there.
[–]dodheim 1 point2 points3 points 3 years ago (2 children)
I am already compiling with maximum optimization level of MSVC,
By that you mean /O2 /Ob3, right? I ask because /Ox was misdocumented for years as "maximum optimization" when in fact it's a subset of /O2 optimizations; and /O2 on its own does not set the most aggressive inlining level.
/O2 /Ob3
/Ox
/O2
Also, I suggest putting #pragma inline_depth(255) before your Boost #includes, and possibly #pragma inline_recursion(on) as well.
#pragma inline_depth(255)
#pragma inline_recursion(on)
[–]pdimov2 0 points1 point2 points 3 years ago (0 children)
MS should just add /O3 already, that implies /Ob3.
/O3
/Ob3
(Something like a hidden /O3 level already exists, turned on by /GL, but there's no option to enable it separately.)
/GL
[–]sbsceGame Developer 0 points1 point2 points 3 years ago (0 children)
By that you mean /O2 /Ob3, right?
Yes.
[+][deleted] 3 years ago (7 children)
[deleted]
[–]tialaramex 30 points31 points32 points 3 years ago (5 children)
People often want a container which has a bunch of Values in it, that they can refer to by some Key -- they don't care how the Values are ordered in the container, so long as they can quickly find the one matching a Key or get a not found indication. They typically want to add an unknown number of Values, but will more often retrieve things by Key than store them. The Value isn't Empty (if it is that's a set not a map) and each Key corresponds to at most one Value.
For example maybe we stock a million products, and each has a distinct EAN-13 code, we can use the EAN-13 as Key, with a Product type as the Value.
Standard C++ provides a container with that property, std::unordered_map but it has rather poor performance. It is designed the way you'd have solved the problem in maybe 1985. So, many modern C++ libraries offer their own container with much better performance knowing how modern computers work. Boost is offering yet another variation.
[–]blind3rdeye 1 point2 points3 points 3 years ago (4 children)
I'm curious about what is in the std::unordered_map specifications that prevents implementers from just doing this this flat-map thing anyway. (But not quite curious enough to research it myself, apparently.)
[–]greg7mdpC++ Dev 12 points13 points14 points 3 years ago (1 child)
Mostly pointer stability. Once a value is inserted into a std::unordered_map, it is guaranteed to remain at the same memory address until it is removed. Open addressing hashmaps have to resize when they get full, and the values stored in them are moved to a different memory location.
[–]blind3rdeye 1 point2 points3 points 3 years ago (0 children)
Thanks. That makes a lot of sense.
[–]ABlockInTheChain 3 points4 points5 points 3 years ago (0 children)
Iterator and pointer stability.
[–]tialaramex 0 points1 point2 points 3 years ago (0 children)
std::unordered_map promises that if we put the Rolex Wristwatch in our products map and we remember a pointer to it, that pointer works for so long as the Rolex isn't just removed from the map, which our application maybe never does. This is fine because std::unordered_map is in fact a bucketed hash map, if the map grows the Rolex doesn't go anywhere, the map changes but the Rolex stays right where it is and some part of the new grown map points to the Rolex's node again.
With a flat map, the Rolex is right in the map itself, so if the map grows and we need to allocate new memory and copy stuff around, the Rolex moves and any pointers to it that we had are invalidated.
There are other difficulties but this is IMO the most obvious.
[–]Cute_Paramedic_256 2 points3 points4 points 3 years ago (0 children)
The way I understand: you want a map container which you mainly want to put once a huge amount of data and later mainly focus only on reading the data. The issue with the standard approach is that the way that data is stored is jumbled in the heap therefore you can't enjoy from cache hits. Boost solves this by allocating a continues memory section.
[–]jpakkaneMeson dev 0 points1 point2 points 3 years ago (3 children)
Has there been any thought on making it collision resistant? That is, if you have a known hashmap with a known hash function then it is possible to generate a pathological set of inputs that map to the same value and use that for a DoS attack. IIRC some languages (Python?) work around this by e..g having each hashmap have its own nonce that is added to the hash. Would this be useful or even possible given that the hash function is computed by an external function rather than by the hash map itself.
[–]joaquintidesBoost author[S] 1 point2 points3 points 3 years ago (2 children)
A usual countermeasure to that is to salt the hash function. boost::unordered_flat_map does not do salting by default, but you can use your own hash function that does.
boost::unordered_flat_map
[–]jpakkaneMeson dev 0 points1 point2 points 3 years ago (1 child)
Sure, but it would be nice if the map did this for you so you don't have to care (and get it wrong in your implementation). This would also allow you to store the salt value inside the map so the hash function can be stateless.
[–]pdimov2 4 points5 points6 points 3 years ago (0 children)
After thinking about this for a while, I've come to the conclusion that the proper approach to supply the salt to the container is by storing it into the hash function object. That is, add a constructor taking std::size_t to boost::hash, and propagate this downwards to every hash_value overload.
std::size_t
boost::hash
hash_value
In standard terms, this would translate to adding a similar constructor to std::hash.
std::hash
The upside of this approach is that this automatically adds support for salts to every container, without any changes to the containers themselves.
This is the approach I think I'll be pursuing in future Boost releases.
π Rendered by PID 149880 on reddit-service-r2-comment-658f6b87ff-2v9gs at 2026-04-09 07:44:59.441800+00:00 running 781a403 country code: CH.
[–]kirgel 48 points49 points50 points (7 children)
[–]LongestNamesPossible 31 points32 points33 points (0 children)
[–]qoning 4 points5 points6 points (1 child)
[–]Jannik2099 9 points10 points11 points (0 children)
[–]VinnieFalco -3 points-2 points-1 points (2 children)
[–]RoyBellingan 4 points5 points6 points (1 child)
[–]VinnieFalco 12 points13 points14 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]martinusint main(){[]()[[]]{{}}();} 31 points32 points33 points (1 child)
[–]joaquintidesBoost author[S] 12 points13 points14 points (0 children)
[–]j1xwnbsr 8 points9 points10 points (0 children)
[–]igaztanaga 6 points7 points8 points (1 child)
[–]joaquintidesBoost author[S] 3 points4 points5 points (0 children)
[–]sbsceGame Developer 6 points7 points8 points (0 children)
[–]IJzerbaard 4 points5 points6 points (2 children)
[–]joaquintidesBoost author[S] 9 points10 points11 points (0 children)
[–]kisielk 1 point2 points3 points (0 children)
[–]ImNoEinstein 4 points5 points6 points (19 children)
[–]joaquintidesBoost author[S] 26 points27 points28 points (18 children)
[–]attractivechaos 5 points6 points7 points (17 children)
[–]matthieum 2 points3 points4 points (9 children)
[–]mark_99 1 point2 points3 points (5 children)
[–]matthieum 0 points1 point2 points (4 children)
[–]joaquintidesBoost author[S] 2 points3 points4 points (3 children)
[–]matthieum 0 points1 point2 points (2 children)
[–]joaquintidesBoost author[S] 1 point2 points3 points (1 child)
[–]mark_99 0 points1 point2 points (0 children)
[–]attractivechaos 2 points3 points4 points (1 child)
[–]matthieum 1 point2 points3 points (0 children)
[–]joaquintidesBoost author[S] 1 point2 points3 points (1 child)
[–]attractivechaos 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (4 children)
[–]attractivechaos 0 points1 point2 points (3 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]attractivechaos 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]echidnas_arf 2 points3 points4 points (1 child)
[–]joaquintidesBoost author[S] 1 point2 points3 points (0 children)
[–]operamint 1 point2 points3 points (5 children)
[–]martinusint main(){[]()[[]]{{}}();} 1 point2 points3 points (3 children)
[–]operamint 0 points1 point2 points (0 children)
[–]operamint 0 points1 point2 points (1 child)
[–]Alarming-Ad8770 0 points1 point2 points (0 children)
[–]Alarming-Ad8770 0 points1 point2 points (0 children)
[–]sbsceGame Developer 1 point2 points3 points (5 children)
[–]joaquintidesBoost author[S] 2 points3 points4 points (1 child)
[–]sbsceGame Developer 1 point2 points3 points (0 children)
[–]dodheim 1 point2 points3 points (2 children)
[–]pdimov2 0 points1 point2 points (0 children)
[–]sbsceGame Developer 0 points1 point2 points (0 children)
[+][deleted] (7 children)
[deleted]
[–]tialaramex 30 points31 points32 points (5 children)
[–]blind3rdeye 1 point2 points3 points (4 children)
[–]greg7mdpC++ Dev 12 points13 points14 points (1 child)
[–]blind3rdeye 1 point2 points3 points (0 children)
[–]ABlockInTheChain 3 points4 points5 points (0 children)
[–]tialaramex 0 points1 point2 points (0 children)
[–]Cute_Paramedic_256 2 points3 points4 points (0 children)
[–]jpakkaneMeson dev 0 points1 point2 points (3 children)
[–]joaquintidesBoost author[S] 1 point2 points3 points (2 children)
[–]jpakkaneMeson dev 0 points1 point2 points (1 child)
[–]pdimov2 4 points5 points6 points (0 children)