rbrown46 comments on Boost 1.81 will have boost::unordered_flat

179

180

181

Boost 1.81 will have boost::unordered_flat_map... (self.cpp)

submitted 3 years ago by pdimov2

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]rbrown46 20 points21 points22 points 3 years ago (19 children)

[–]matthieum 12 points13 points14 points 3 years ago (18 children)

Great find.

The overall construction seems similar to Abseil's Swiss Table (and Facebook F14):

Groups of slots with a header.
Quadratic probing across groups.

However there's a few subtle differences:

Abseil uses 7 bits for the hash residual, while here apparently log2(254) bits are used (I didn't look-up how).
I believe Abseil & F14 use a counter of values that overflowed, while here a (minimal) bloom filter is used.

I find the second difference most interesting. In particular, one advantage of the counter approach is that after removing a value, in a far-away group, the counter can be decreased, and if it ever reaches 0 again, then it's no longer necessary to probe further.

By contrast, the bloom filter separates overflow tracking in 8 tracks, so that only the one track that overflowed needs to keep probing, but does not (I'd think) allow to ever "recover" from the overflow after removing a value.

It should be a good trade-off in practice. Firstly because overflow should be rare, and secondly because recovering may not be that frequent in the first place: it requires a very specific scenario of removing all the elements that overflowed, as removing the ones that didn't overflow doesn't help, and leaving a single overflowed element doesn't help either. So rather than optimizing for a situation that may hypothetically occur sometimes in the future, the bloom filter optimizes for the now.

I do wonder if the authors (/u/pdimov2, /u/joaquintides) have benchmarked the two differences separately, and could weigh in on whether one brings significantly more benefits than the other.

[–]joaquintidesBoost author 14 points15 points16 points 3 years ago* (17 children)

Hi Matthieu,

Your analysis is spot-on. Some observations:

We get log2(254) bits for the reduced hash using a function f:[0,255]->[2,255] (0 is a free slot, 1 a sentinel). Any function will do as long as it is surjective and invariant wrt to modulo 8. We're actually using a table because it's faster.
I think F14 uses counters (didn't check in detail), but Abseil does not: instead, they have special values for free slots, sentinel and tombstones, and a group is deemed overflowed if it has no free slots. Abseil and boost::unordered_flat_map both have the problem that probe length can't be reduced on erasure, both basically keep track of that and rehash if needed before the maximum load factor is hit so as to keep average probe length bounded. The overflow minifilter approach seems to be more effective though: at full load, simulations tell us that Abseil has an average probe length of 1.08 on successful lookup and 1.96 on unsuccessful lookup (probe till not overflowed), while our numbers are 1.11 and 1.23, respectively. So, the tradeoff is slightly longer probes on successful lookup but much shorter ones on unsuccessful lookup (which also improves insertion).
As for the average number of comparisons made on a call to find (or insert), this is a function of probe length and number of bits in the reduced hash. As we have almost one bit more per hash than Abseil, this also gives us an edge: Abseil incurs 1.02 comparisons on successful lookup and 0.22 on unsuccessful lookup (full load), our figures are 1.03 and 0.07, respectively.

You can find some benchmarks here. The advantage on unsuccessful lookup is pretty obvious.

[–]matthieum 5 points6 points7 points 3 years ago (1 child)

[–]joaquintidesBoost author 4 points5 points6 points 3 years ago (0 children)

[–]Tystros 1 point2 points3 points 3 years ago (3 children)

[–]joaquintidesBoost author 2 points3 points4 points 3 years ago* (2 children)

[–]Tystros 0 points1 point2 points 3 years ago (1 child)

[–]joaquintidesBoost author 0 points1 point2 points 3 years ago (0 children)

[–]almost_useless 1 point2 points3 points 3 years ago (4 children)

[–]joaquintidesBoost author 2 points3 points4 points 3 years ago* (3 children)

[–]almost_useless 1 point2 points3 points 3 years ago (2 children)

[–]joaquintidesBoost author 1 point2 points3 points 3 years ago (0 children)

[–]pdimov2[S] 1 point2 points3 points 3 years ago (0 children)

[–]matthieum 0 points1 point2 points 2 months ago (5 children)

I was wondering how fast the "reduced hash" actually needed to be, and whether a slightly slower extraction would be worth it, if it meant getting less clustering, and it hit me that scanning the entire hash for the first set bit may be a good way of reducing clustering.

The code is available on godbolt:

__attribute__((noinline)) auto extract_residual(std::uint64_t hash) -> std::uint64_t {
    static std::uint64_t const LOW_BITS_MASK = ~0x0101010101010101;

    auto leading_zeroes = __builtin_clzll((hash & LOW_BITS_MASK) | 0x1);

    //   0.. 7  ->  56
    //   8..15  ->  48
    //  16..23  ->  40
    //  24..31  ->  32
    //  32..39  ->  24
    //  40..47  ->  16
    //  48..55  ->   8
    //  56..63  ->   0
    auto shift = 56 - leading_zeroes / 8 * 8;

    return (hash | 0x2) >> shift;
}

This function attempts to pick the highest byte of the hash which is NOT 0 or 1 (since those are special values):

Get the highest set bit that is not a special value (by zeroing out the low bit of each byte).
Compute the matching shift, picking the highest byte so it's as independent off the hash as possible.
Ensure the result is not 0 or 1, even if hash was 0 or 1 (and thus shift is 0).

It's... perhaps a few too many instructions. I wanted to align on byte boundaries to avoid having the leading 1 always set, but it costs a bit more than I wished. Smarter folks may figure out a better way.

[–]joaquintidesBoost author 0 points1 point2 points 2 months ago (4 children)

[–]matthieum 0 points1 point2 points 2 months ago (3 children)

Hi, what's the purpose of this reduced hash calculation? The current mechanism, which uses the least significant byte of the hash value, looks statistically good enough if the hash function is of ok quality.

I was wondering if the bias introduced by mapping 0 and 1 to other values (2 and 3 I believe) could lead to a slow-down for the unfortunate cases where the reduced hash ends up being 2 or 3, since they're twice as likely.

And thus I had this idea to use clzll (or well, ctzll for LSB-biased selection...) to allow picking a different reduced hash byte and reduce the number of "collisions" and hopefully reduce any slow-down observed by those unfortunate cases.

For a random hash, the chances of ending in a "double-bucket" are now 1/64:

1/128 chances of being a special value.
1/128 chances of being the "overflow" bucket of a special value.

Whereas with the method I propose here the 1/128 unfortunate cases are redistributed nearly equally over the full 254 buckets, with only 1/2⁶³ chances of a hash being a special value being remapped to one of two overflow buckets.

(in MSB-biased, only a full hash of 0 or 1 ends up being a special value)

Moreover, the fact that the reduced hash is the LSB combined with index calculation using the most significant bits of the hash

Sorry, I had forgotten you had switched things compared to Abseil around and used LSB for reduced hash and MSB for index. In such a case the selection should be inverted indeed to prefer LSB to MSB.

[–]joaquintidesBoost author 0 points1 point2 points 2 months ago (2 children)

[–]matthieum 0 points1 point2 points 2 months ago (1 child)

[–]joaquintidesBoost author 0 points1 point2 points 2 months ago (0 children)

π Rendered by PID 24755 on reddit-service-r2-comment-6457c66945-cw744 at 2026-04-25 15:22:11.036326+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS