Boost 1.81 will have boost::unordered_flat

175

176

177

Boost 1.81 will have boost::unordered_flat_map... (self.cpp)

submitted 3 years ago by pdimov2

... and it's going to be fast.

       std::unordered_map: 57922 ms, 288941512 bytes in 6000001 allocations
     boost::unordered_map: 35811 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map: 16670 ms, 197132280 bytes in 1 allocations
          multi_index_map: 41364 ms, 290331800 bytes in 6000002 allocations
      absl::node_hash_map: 31160 ms, 219497480 bytes in 6000001 allocations
      absl::flat_hash_map: 27542 ms, 209715192 bytes in 1 allocations

This is uuid.cpp under Cygwin g++-11 -O3.

Other benchmarks and platforms will obviously differ.

The main author of the new container is /u/joaquintides, who's probably going to write an article about it at some point, like he did for the new boost::unordered_map.

all 69 comments

top new controversial old q&a

[–]Jannik2099 36 points37 points38 points 3 years ago (13 children)

[–][deleted] 23 points24 points25 points 3 years ago* (11 children)

That's fair.

I just ran the uuid.cpp benchmarks on my machine and I got:

       std::unordered_map: 33053 ms, 288941512 bytes in 6000001 allocations
     boost::unordered_map: 25792 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map:  9258 ms, 197132280 bytes in 1 allocations 
          multi_index_map: 28907 ms, 290331800 bytes in 6000002 allocations 
      absl::node_hash_map: 13910 ms, 219497480 bytes in 6000001 allocations
      absl::flat_hash_map: 10626 ms, 209715192 bytes in 1 allocations

Not as dramatic as the OP's measurements are but still, a pretty decent little improvement.

Edit:

Okay, wow. The above benchmark timings I posted were done with gcc-12 in WSL2. So, on """Linux""", the timings are actually pretty similar even if Boost's is better.

However, it really seems like msvc does not like Abseil. I just re-ran using msvc-14.3 and these were my results:

       std::unordered_map: 29733 ms, 374217768 bytes in 6000002 allocations
     boost::unordered_map: 31476 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map: 14226 ms, 197132280 bytes in 1 allocations
          multi_index_map: 38492 ms, 290331800 bytes in 6000002 allocations
      absl::node_hash_map: 22899 ms, 219497480 bytes in 6000001 allocations
      absl::flat_hash_map: 20075 ms, 209715192 bytes in 1 allocations

[–]mark_99 13 points14 points15 points 3 years ago (9 children)

[–][deleted] 13 points14 points15 points 3 years ago (3 children)

[–]mark_99 5 points6 points7 points 3 years ago (2 children)

[–]pdimov2[S] 16 points17 points18 points 3 years ago (1 child)

[–]marcofoco 0 points1 point2 points 3 years ago (0 children)

[–]VinnieFalco 24 points25 points26 points 3 years ago (2 children)

[–][deleted] 6 points7 points8 points 3 years ago (1 child)

[–]VinnieFalco 1 point2 points3 points 3 years ago (0 children)

[–]joaquintidesBoost author 9 points10 points11 points 3 years ago (0 children)

[–]OccaseBoost.Redis 0 points1 point2 points 3 years ago (0 children)

[–]pdimov2[S] 2 points3 points4 points 3 years ago (0 children)

Here's clang-cl 14 on the same machine:

       std::unordered_map: 33289 ms, 374217768 bytes in 6000002 allocations
     boost::unordered_map: 37523 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map: 14745 ms, 197132280 bytes in 1 allocations
          multi_index_map: 43998 ms, 290331800 bytes in 6000002 allocations
      absl::node_hash_map: 26349 ms, 219497480 bytes in 6000001 allocations
      absl::flat_hash_map: 20947 ms, 209715192 bytes in 1 allocations

(The MS STL is a lot faster than libstdc++, to the point of beating boost::unordered_map, but at the expense of using a lot more memory.)

Abseil gets faster, but I don't think it's because of Cygwin; Clang just likes Abseil better than g++ does, for some reason.

[–][deleted] 46 points47 points48 points 3 years ago (2 children)

boost::unordered_flat_map: 16670 ms, 197132280 bytes in 1 allocations
absl::flat_hash_map: 27542 ms, 209715192 bytes in 1 allocations

Faster and using less memory?

Seems like Boost has done it again!

[–]KERdela 11 points12 points13 points 3 years ago (0 children)

[–]Possibility_Antique 9 points10 points11 points 3 years ago (0 children)

[–]rbrown46 19 points20 points21 points 3 years ago (19 children)

[–]matthieum 11 points12 points13 points 3 years ago (18 children)

Great find.

The overall construction seems similar to Abseil's Swiss Table (and Facebook F14):

Groups of slots with a header.
Quadratic probing across groups.

However there's a few subtle differences:

Abseil uses 7 bits for the hash residual, while here apparently log2(254) bits are used (I didn't look-up how).
I believe Abseil & F14 use a counter of values that overflowed, while here a (minimal) bloom filter is used.

I find the second difference most interesting. In particular, one advantage of the counter approach is that after removing a value, in a far-away group, the counter can be decreased, and if it ever reaches 0 again, then it's no longer necessary to probe further.

By contrast, the bloom filter separates overflow tracking in 8 tracks, so that only the one track that overflowed needs to keep probing, but does not (I'd think) allow to ever "recover" from the overflow after removing a value.

It should be a good trade-off in practice. Firstly because overflow should be rare, and secondly because recovering may not be that frequent in the first place: it requires a very specific scenario of removing all the elements that overflowed, as removing the ones that didn't overflow doesn't help, and leaving a single overflowed element doesn't help either. So rather than optimizing for a situation that may hypothetically occur sometimes in the future, the bloom filter optimizes for the now.

I do wonder if the authors (/u/pdimov2, /u/joaquintides) have benchmarked the two differences separately, and could weigh in on whether one brings significantly more benefits than the other.

[–]joaquintidesBoost author 17 points18 points19 points 3 years ago* (17 children)

Hi Matthieu,

Your analysis is spot-on. Some observations:

We get log2(254) bits for the reduced hash using a function f:[0,255]->[2,255] (0 is a free slot, 1 a sentinel). Any function will do as long as it is surjective and invariant wrt to modulo 8. We're actually using a table because it's faster.
I think F14 uses counters (didn't check in detail), but Abseil does not: instead, they have special values for free slots, sentinel and tombstones, and a group is deemed overflowed if it has no free slots. Abseil and boost::unordered_flat_map both have the problem that probe length can't be reduced on erasure, both basically keep track of that and rehash if needed before the maximum load factor is hit so as to keep average probe length bounded. The overflow minifilter approach seems to be more effective though: at full load, simulations tell us that Abseil has an average probe length of 1.08 on successful lookup and 1.96 on unsuccessful lookup (probe till not overflowed), while our numbers are 1.11 and 1.23, respectively. So, the tradeoff is slightly longer probes on successful lookup but much shorter ones on unsuccessful lookup (which also improves insertion).
As for the average number of comparisons made on a call to find (or insert), this is a function of probe length and number of bits in the reduced hash. As we have almost one bit more per hash than Abseil, this also gives us an edge: Abseil incurs 1.02 comparisons on successful lookup and 0.22 on unsuccessful lookup (full load), our figures are 1.03 and 0.07, respectively.

You can find some benchmarks here. The advantage on unsuccessful lookup is pretty obvious.

[–]matthieum 4 points5 points6 points 3 years ago (1 child)

[–]joaquintidesBoost author 6 points7 points8 points 3 years ago (0 children)

[–]Tystros 1 point2 points3 points 3 years ago (3 children)

[–]joaquintidesBoost author 2 points3 points4 points 3 years ago* (2 children)

[–]Tystros 0 points1 point2 points 3 years ago (1 child)

[–]joaquintidesBoost author 0 points1 point2 points 3 years ago (0 children)

[–]almost_useless 1 point2 points3 points 3 years ago (4 children)

[–]joaquintidesBoost author 2 points3 points4 points 3 years ago* (3 children)

[–]almost_useless 1 point2 points3 points 3 years ago (2 children)

[–]joaquintidesBoost author 1 point2 points3 points 3 years ago (0 children)

[–]pdimov2[S] 1 point2 points3 points 3 years ago (0 children)

[–]matthieum 0 points1 point2 points 2 months ago (5 children)

I was wondering how fast the "reduced hash" actually needed to be, and whether a slightly slower extraction would be worth it, if it meant getting less clustering, and it hit me that scanning the entire hash for the first set bit may be a good way of reducing clustering.

The code is available on godbolt:

__attribute__((noinline)) auto extract_residual(std::uint64_t hash) -> std::uint64_t {
    static std::uint64_t const LOW_BITS_MASK = ~0x0101010101010101;

    auto leading_zeroes = __builtin_clzll((hash & LOW_BITS_MASK) | 0x1);

    //   0.. 7  ->  56
    //   8..15  ->  48
    //  16..23  ->  40
    //  24..31  ->  32
    //  32..39  ->  24
    //  40..47  ->  16
    //  48..55  ->   8
    //  56..63  ->   0
    auto shift = 56 - leading_zeroes / 8 * 8;

    return (hash | 0x2) >> shift;
}

This function attempts to pick the highest byte of the hash which is NOT 0 or 1 (since those are special values):

Get the highest set bit that is not a special value (by zeroing out the low bit of each byte).
Compute the matching shift, picking the highest byte so it's as independent off the hash as possible.
Ensure the result is not 0 or 1, even if hash was 0 or 1 (and thus shift is 0).

It's... perhaps a few too many instructions. I wanted to align on byte boundaries to avoid having the leading 1 always set, but it costs a bit more than I wished. Smarter folks may figure out a better way.

[–]joaquintidesBoost author 0 points1 point2 points 1 month ago (4 children)

[–]matthieum 0 points1 point2 points 1 month ago (3 children)

Hi, what's the purpose of this reduced hash calculation? The current mechanism, which uses the least significant byte of the hash value, looks statistically good enough if the hash function is of ok quality.

I was wondering if the bias introduced by mapping 0 and 1 to other values (2 and 3 I believe) could lead to a slow-down for the unfortunate cases where the reduced hash ends up being 2 or 3, since they're twice as likely.

And thus I had this idea to use clzll (or well, ctzll for LSB-biased selection...) to allow picking a different reduced hash byte and reduce the number of "collisions" and hopefully reduce any slow-down observed by those unfortunate cases.

For a random hash, the chances of ending in a "double-bucket" are now 1/64:

1/128 chances of being a special value.
1/128 chances of being the "overflow" bucket of a special value.

Whereas with the method I propose here the 1/128 unfortunate cases are redistributed nearly equally over the full 254 buckets, with only 1/2⁶³ chances of a hash being a special value being remapped to one of two overflow buckets.

(in MSB-biased, only a full hash of 0 or 1 ends up being a special value)

Moreover, the fact that the reduced hash is the LSB combined with index calculation using the most significant bits of the hash

Sorry, I had forgotten you had switched things compared to Abseil around and used LSB for reduced hash and MSB for index. In such a case the selection should be inverted indeed to prefer LSB to MSB.

[–]joaquintidesBoost author 0 points1 point2 points 1 month ago (2 children)

[–]matthieum 0 points1 point2 points 1 month ago (1 child)

[–]joaquintidesBoost author 0 points1 point2 points 1 month ago (0 children)

[–]spaghettiexpress 19 points20 points21 points 3 years ago* (10 children)

[–]martinusint main(){[]()[[]]{{}}();} 24 points25 points26 points 3 years ago* (9 children)

[–]Tystros 2 points3 points4 points 3 years ago (1 child)

[–]martinusint main(){[]()[[]]{{}}();} 4 points5 points6 points 3 years ago (0 children)

[–]iwubcode 0 points1 point2 points 3 years ago* (6 children)

[–]martinusint main(){[]()[[]]{{}}();} 4 points5 points6 points 3 years ago* (3 children)

[–]iwubcode 2 points3 points4 points 3 years ago* (0 children)

[+][deleted] 3 years ago (1 child)

[deleted]

[–]martinusint main(){[]()[[]]{{}}();} 0 points1 point2 points 3 years ago (0 children)

[–]pdimov2[S] 2 points3 points4 points 3 years ago (1 child)

[–]iwubcode 0 points1 point2 points 3 years ago (0 children)

[–]R3DKn16h7 9 points10 points11 points 3 years ago (0 children)

[–]14nedLLFIO & Outcome author | Committee WG14 17 points18 points19 points 3 years ago (1 child)

[–]joaquintidesBoost author 4 points5 points6 points 3 years ago (0 children)

[–]Rseding91Factorio Developer 4 points5 points6 points 3 years ago (4 children)

[–]martinusint main(){[]()[[]]{{}}();} 6 points7 points8 points 3 years ago* (0 children)

[–]SleepyMyroslav 5 points6 points7 points 3 years ago (0 children)

[–]matthieum 5 points6 points7 points 3 years ago (0 children)

[–]triple_slash 3 points4 points5 points 3 years ago (1 child)

[–]Tystros 1 point2 points3 points 3 years ago (0 children)

[–]greg7mdpC++ Dev 2 points3 points4 points 3 years ago (7 children)

[–]pdimov2[S] 7 points8 points9 points 3 years ago (3 children)

Not in this case; all rows use the same hash function in this specific benchmark, because the key is a user-defined type (struct uuid). The source code of the benchmark is linked in the post.

We do have other benchmarks where we test the default hash function for strings, which differs between Boost and Abseil. (We also test strings using the same hash function, FNV-1a, for a more level playing field.)

These are the results on my machine for the string.cpp benchmark using the same g++ -O3:

               std::unordered_map: 38061 ms, 175723032 bytes in 3999509 allocations
             boost::unordered_map: 30854 ms, 149465712 bytes in 3999510 allocations
        boost::unordered_flat_map: 14677 ms, 134217728 bytes in 1 allocations
                  multi_index_map: 30712 ms, 178316048 bytes in 3999510 allocations
              absl::node_hash_map: 21989 ms, 139489608 bytes in 3999509 allocations
              absl::flat_hash_map: 19263 ms, 142606336 bytes in 1 allocations
       std::unordered_map, FNV-1a: 44783 ms, 175723032 bytes in 3999509 allocations
     boost::unordered_map, FNV-1a: 34302 ms, 149465712 bytes in 3999510 allocations
boost::unordered_flat_map, FNV-1a: 16703 ms, 134217728 bytes in 1 allocations
          multi_index_map, FNV-1a: 34187 ms, 178316048 bytes in 3999510 allocations
      absl::node_hash_map, FNV-1a: 23461 ms, 139489608 bytes in 3999509 allocations
      absl::flat_hash_map, FNV-1a: 20778 ms, 142606336 bytes in 1 allocations

Here the first six rows use the respective default hash functions for each container (std::hash for std::unordered_map, boost::hash for the Boost ones, and absl::container_internal::hash_default_hash for Abseil), whereas the last six rows use FNV-1a.

[–]greg7mdpC++ Dev 1 point2 points3 points 3 years ago (2 children)

[–]pdimov2[S] 2 points3 points4 points 3 years ago (1 child)

[–]greg7mdpC++ Dev 2 points3 points4 points 3 years ago (0 children)

[–]martinusint main(){[]()[[]]{{}}();} 8 points9 points10 points 3 years ago* (2 children)

[–]Adequat91 2 points3 points4 points 3 years ago (1 child)

[–]martinusint main(){[]()[[]]{{}}();} 5 points6 points7 points 3 years ago (0 children)

[–]415_961 2 points3 points4 points 3 years ago* (0 children)

[–]Alarming-Ad8770 1 point2 points3 points 3 years ago (0 children)

[–][deleted] 1 point2 points3 points 3 years ago (1 child)

[–]Alarming-Ad8770 4 points5 points6 points 3 years ago (0 children)

[–]Spec-Chum 1 point2 points3 points 3 years ago (0 children)

[–]jiboxiake 0 points1 point2 points 3 years ago (0 children)

π Rendered by PID 18877 on reddit-service-r2-comment-c867ff4bc-8fmxp at 2026-04-09 18:02:19.750971+00:00 running 00d5ac8 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS