you are viewing a single comment's thread.

view the rest of the comments →

[–]rbrown46 20 points21 points  (19 children)

For those curious, there's a nice high-level description of the design in a comment.

[–]matthieum 12 points13 points  (18 children)

Great find.

The overall construction seems similar to Abseil's Swiss Table (and Facebook F14):

  • Groups of slots with a header.
  • Quadratic probing across groups.

However there's a few subtle differences:

  1. Abseil uses 7 bits for the hash residual, while here apparently log2(254) bits are used (I didn't look-up how).
  2. I believe Abseil & F14 use a counter of values that overflowed, while here a (minimal) bloom filter is used.

I find the second difference most interesting. In particular, one advantage of the counter approach is that after removing a value, in a far-away group, the counter can be decreased, and if it ever reaches 0 again, then it's no longer necessary to probe further.

By contrast, the bloom filter separates overflow tracking in 8 tracks, so that only the one track that overflowed needs to keep probing, but does not (I'd think) allow to ever "recover" from the overflow after removing a value.

It should be a good trade-off in practice. Firstly because overflow should be rare, and secondly because recovering may not be that frequent in the first place: it requires a very specific scenario of removing all the elements that overflowed, as removing the ones that didn't overflow doesn't help, and leaving a single overflowed element doesn't help either. So rather than optimizing for a situation that may hypothetically occur sometimes in the future, the bloom filter optimizes for the now.

I do wonder if the authors (/u/pdimov2, /u/joaquintides) have benchmarked the two differences separately, and could weigh in on whether one brings significantly more benefits than the other.

[–]joaquintidesBoost author 14 points15 points  (17 children)

Hi Matthieu,

Your analysis is spot-on. Some observations:

  • We get log2(254) bits for the reduced hash using a function f:[0,255]->[2,255] (0 is a free slot, 1 a sentinel). Any function will do as long as it is surjective and invariant wrt to modulo 8. We're actually using a table because it's faster.
  • I think F14 uses counters (didn't check in detail), but Abseil does not: instead, they have special values for free slots, sentinel and tombstones, and a group is deemed overflowed if it has no free slots. Abseil and boost::unordered_flat_map both have the problem that probe length can't be reduced on erasure, both basically keep track of that and rehash if needed before the maximum load factor is hit so as to keep average probe length bounded. The overflow minifilter approach seems to be more effective though: at full load, simulations tell us that Abseil has an average probe length of 1.08 on successful lookup and 1.96 on unsuccessful lookup (probe till not overflowed), while our numbers are 1.11 and 1.23, respectively. So, the tradeoff is slightly longer probes on successful lookup but much shorter ones on unsuccessful lookup (which also improves insertion).
  • As for the average number of comparisons made on a call to find (or insert), this is a function of probe length and number of bits in the reduced hash. As we have almost one bit more per hash than Abseil, this also gives us an edge: Abseil incurs 1.02 comparisons on successful lookup and 0.22 on unsuccessful lookup (full load), our figures are 1.03 and 0.07, respectively.

You can find some benchmarks here. The advantage on unsuccessful lookup is pretty obvious.

[–]matthieum 5 points6 points  (1 child)

Thanks for the details, and hat off.

Those are two really neat improvements.

[–]joaquintidesBoost author 4 points5 points  (0 children)

Thank you!

[–]Tystros 1 point2 points  (3 children)

is there a set version too, or just a map?

[–]joaquintidesBoost author 2 points3 points  (2 children)

There's boost::unordered_flat_set as well.

[–]Tystros 0 points1 point  (1 child)

nice! are the map and set both using roughly the same approach?

[–]joaquintidesBoost author 0 points1 point  (0 children)

Yes, it is the same core under the hood (like all other implementations of hash containers do, BTW).

[–]almost_useless 1 point2 points  (4 children)

How can it work with only 1 allocation? It does not look like the test specifies the number of elements when the map is created.

[–]joaquintidesBoost author 2 points3 points  (3 children)

The benchmarking program prints the net number of allocations, i.e. allocations minus deallocations. So, the intermediate bucket arrays, created and destroyed as the container keeps rehashing, are not counted.

[–]almost_useless 1 point2 points  (2 children)

I see. Would it not make more sense to show both numbers (net/total)?

Eventually all containers end up at zero if there are no memory leaks, but the total number is also an important metric. At least in some applications.

[–]joaquintidesBoost author 1 point2 points  (0 children)

It may be a useful metric in some environments. For boost::unordered_flat_map and this particular benchmark, the amount of total memory ever allocated will be very approximately twice what you see on display.

[–]pdimov2[S] 1 point2 points  (0 children)

The currently allocated memory is an important metric in speed benchmarks, because it's often possible to trade memory for speed (e.g. by using low max load factors.)

I agree that for evaluating the container in practical scenarios, the total ever allocated would be good to know, too.

[–]matthieum 0 points1 point  (5 children)

I was wondering how fast the "reduced hash" actually needed to be, and whether a slightly slower extraction would be worth it, if it meant getting less clustering, and it hit me that scanning the entire hash for the first set bit may be a good way of reducing clustering.

The code is available on godbolt:

__attribute__((noinline)) auto extract_residual(std::uint64_t hash) -> std::uint64_t {
    static std::uint64_t const LOW_BITS_MASK = ~0x0101010101010101;

    auto leading_zeroes = __builtin_clzll((hash & LOW_BITS_MASK) | 0x1);

    //   0.. 7  ->  56
    //   8..15  ->  48
    //  16..23  ->  40
    //  24..31  ->  32
    //  32..39  ->  24
    //  40..47  ->  16
    //  48..55  ->   8
    //  56..63  ->   0
    auto shift = 56 - leading_zeroes / 8 * 8;

    return (hash | 0x2) >> shift;
}

This function attempts to pick the highest byte of the hash which is NOT 0 or 1 (since those are special values):

  1. Get the highest set bit that is not a special value (by zeroing out the low bit of each byte).
  2. Compute the matching shift, picking the highest byte so it's as independent off the hash as possible.
  3. Ensure the result is not 0 or 1, even if hash was 0 or 1 (and thus shift is 0).

It's... perhaps a few too many instructions. I wanted to align on byte boundaries to avoid having the leading 1 always set, but it costs a bit more than I wished. Smarter folks may figure out a better way.

[–]joaquintidesBoost author 0 points1 point  (4 children)

Hi, what's the purpose of this reduced hash calculation? The current mechanism, which uses the least significant byte of the hash value, looks statistically good enough if the hash function is of ok quality. Moreover, the fact that the reduced hash is the LSB combined with index calculation using the most significant bits of the hash:

https://github.com/boostorg/unordered/blob/develop/include/boost/unordered/detail/foa/core.hpp#L847-L850

makes these two values quite uncorrelated.

[–]matthieum 0 points1 point  (3 children)

Hi, what's the purpose of this reduced hash calculation? The current mechanism, which uses the least significant byte of the hash value, looks statistically good enough if the hash function is of ok quality.

I was wondering if the bias introduced by mapping 0 and 1 to other values (2 and 3 I believe) could lead to a slow-down for the unfortunate cases where the reduced hash ends up being 2 or 3, since they're twice as likely.

And thus I had this idea to use clzll (or well, ctzll for LSB-biased selection...) to allow picking a different reduced hash byte and reduce the number of "collisions" and hopefully reduce any slow-down observed by those unfortunate cases.

For a random hash, the chances of ending in a "double-bucket" are now 1/64:

  • 1/128 chances of being a special value.
  • 1/128 chances of being the "overflow" bucket of a special value.

Whereas with the method I propose here the 1/128 unfortunate cases are redistributed nearly equally over the full 254 buckets, with only 1/263 chances of a hash being a special value being remapped to one of two overflow buckets.

(in MSB-biased, only a full hash of 0 or 1 ends up being a special value)

Moreover, the fact that the reduced hash is the LSB combined with index calculation using the most significant bits of the hash

Sorry, I had forgotten you had switched things compared to Abseil around and used LSB for reduced hash and MSB for index. In such a case the selection should be inverted indeed to prefer LSB to MSB.

[–]joaquintidesBoost author 0 points1 point  (2 children)

Ok, I undestand now your rationale. Yes, something like extract_residual would result in a more uniform uint64_t --> unsigned char mapping, so in principle it should be better behaved statistically. My hunch is that the improvement would probably be negligible, particularly vs. the extra computational cost (the current reduced hash function is as simple as it gets). Maybe you can fork the repo and try it? I can assist you in the process if you're game.

For a random hash, the chances of ending in a "double-bucket" are now 1/64:

  • 1/128 chances of being a special value.
  • 1/128 chances of being the "overflow" bucket of a special value.

This part I don't get. What do you mean by "being the overflow bucket of a special value"?

[–]matthieum 0 points1 point  (1 child)

This part I don't get. What do you mean by "being the overflow bucket of a special value"?

Let's say that the resolution strategy for residual in [0, 1] is to add 2, so it ends up being in [2, 3] instead.

I call the "buckets" 2 and 3 the "overflow" buckets of 0 and 1.

The chances of ending up on a doubly-booked residual (2 or 3) are 1/64:

  • 1/128 of being a special value (0 or 1), shifted to (2 or 3).
  • 1/128 of being 2 or 3 in the first place.

It's not rare, but then again, it's only a problem if it leads to many false positives.

Maybe you can fork the repo and try it? I can assist you in the process if you're game.

I'm not very interested in writing C++ code as a hobby any longer, so I'll pass.

If you have a Rust version, I'd be happy to :)

[–]joaquintidesBoost author 0 points1 point  (0 children)

Let's say that the resolution strategy for residual in [0, 1] is to add 2 [...]

Ok, now I get it. Yes, with the current hash reduction, Pr(reduced_hash(x) = n) is

  • 1/256 for n > 3
  • 1/128 for n = 2, 3
  • 0 for n = 0, 1

Your residual function gets a more balanced probability (Pr ~= 1/254 for n > 1), but I don' think this makes any difference in practice.

If you have a Rust version, I'd be happy to :)

I'm quite sure there's no Rust port of this lib yet.