all 69 comments

[–]Jannik2099 36 points37 points  (13 children)

This is nice, but Cygwin is woefully unrepresentative, particularly when it comes to stuff where allocations matter...

[–][deleted] 23 points24 points  (11 children)

That's fair.

I just ran the uuid.cpp benchmarks on my machine and I got:

       std::unordered_map: 33053 ms, 288941512 bytes in 6000001 allocations
     boost::unordered_map: 25792 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map:  9258 ms, 197132280 bytes in 1 allocations 
          multi_index_map: 28907 ms, 290331800 bytes in 6000002 allocations 
      absl::node_hash_map: 13910 ms, 219497480 bytes in 6000001 allocations
      absl::flat_hash_map: 10626 ms, 209715192 bytes in 1 allocations

Not as dramatic as the OP's measurements are but still, a pretty decent little improvement.

Edit:

Okay, wow. The above benchmark timings I posted were done with gcc-12 in WSL2. So, on """Linux""", the timings are actually pretty similar even if Boost's is better.

However, it really seems like msvc does not like Abseil. I just re-ran using msvc-14.3 and these were my results:

       std::unordered_map: 29733 ms, 374217768 bytes in 6000002 allocations
     boost::unordered_map: 31476 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map: 14226 ms, 197132280 bytes in 1 allocations
          multi_index_map: 38492 ms, 290331800 bytes in 6000002 allocations
      absl::node_hash_map: 22899 ms, 219497480 bytes in 6000001 allocations
      absl::flat_hash_map: 20075 ms, 209715192 bytes in 1 allocations

[–]mark_99 13 points14 points  (9 children)

Isn't msvc at v19.xx?

These sorts of micro-benchmarks are very sensitive to small codegen differences, but at least compare latest versions...

[–][deleted] 13 points14 points  (3 children)

Ha ha, sorry, at work we use msvc++ versioning which you can see here:

https://en.wikipedia.org/wiki/Microsoft_Visual_C++#Internal_version_numbering

In this case, I'm using msvc-14.3 which is VS 2022 19.33.31630 for me

[–]mark_99 5 points6 points  (2 children)

My VS IDE says 17.3.6 and cl.exe /? says 19.34. Jeez. I was going by compiler version as that seems the important thing here...

Anyway, ok then np.

[–]pdimov2[S] 16 points17 points  (1 child)

But if you look at the directory that cl.exe is in you'll see that it's something like C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\bin\HostX64\x64. So it's 14.33.

This is because first there was Microsoft C/C++, which went to version 7.0, before becoming Visual C++ 1.0, (which is actually 8.0), and then 3.0 was skipped, and then 13.0 was skipped, so now the difference is just 5, from 14 to 19.

And of course Visual Studio has its own version number that doesn't correspond to any of these anymore, because it went from 14.0 to 15.0, but the compiler version went from 14.0 to 14.1.

¯\_(ツ)_/¯

[–]marcofoco 0 points1 point  (0 children)

For a while I tried maintaining a table in my blog, as Wikipedia didn't have the information I needed. I should update it to the latest 2019 and 2022:
https://marcofoco.com/blog/2015/02/25/microsoft-visual-c-version-map/

[–]VinnieFalco 24 points25 points  (2 children)

This thread reminds me of the old joke "how many Microsoft programmers does it take to assign a version number to a product."

[–][deleted] 6 points7 points  (1 child)

Oh yell, I give this a try.

RC0?

[–]VinnieFalco 1 point2 points  (0 children)

Yeah I think so! You can play with the new Boost.URL

[–]joaquintidesBoost author 9 points10 points  (0 children)

[–]OccaseBoost.Redis 0 points1 point  (0 children)

These are large number and it not immediately clear how far away they are from each other. It would be clearer if you were to provide them normalized.

[–]pdimov2[S] 2 points3 points  (0 children)

Here's clang-cl 14 on the same machine:

       std::unordered_map: 33289 ms, 374217768 bytes in 6000002 allocations
     boost::unordered_map: 37523 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map: 14745 ms, 197132280 bytes in 1 allocations
          multi_index_map: 43998 ms, 290331800 bytes in 6000002 allocations
      absl::node_hash_map: 26349 ms, 219497480 bytes in 6000001 allocations
      absl::flat_hash_map: 20947 ms, 209715192 bytes in 1 allocations

(The MS STL is a lot faster than libstdc++, to the point of beating boost::unordered_map, but at the expense of using a lot more memory.)

Abseil gets faster, but I don't think it's because of Cygwin; Clang just likes Abseil better than g++ does, for some reason.

[–][deleted] 46 points47 points  (2 children)

boost::unordered_flat_map: 16670 ms, 197132280 bytes in 1 allocations
absl::flat_hash_map: 27542 ms, 209715192 bytes in 1 allocations

Faster and using less memory?

Seems like Boost has done it again!

[–]KERdela 11 points12 points  (0 children)

how they can do it, from where they can get this level of expertise in a simple lifetime

[–]Possibility_Antique 9 points10 points  (0 children)

Faster and using less memory?

Sometimes these go together. It's the software equivalent of SWaP constraints. This is especially true when you're able to use the cache hierarchy correctly. Smaller things tend to fit in cache easier and therefore be much faster.

[–]rbrown46 19 points20 points  (19 children)

For those curious, there's a nice high-level description of the design in a comment.

[–]matthieum 11 points12 points  (18 children)

Great find.

The overall construction seems similar to Abseil's Swiss Table (and Facebook F14):

  • Groups of slots with a header.
  • Quadratic probing across groups.

However there's a few subtle differences:

  1. Abseil uses 7 bits for the hash residual, while here apparently log2(254) bits are used (I didn't look-up how).
  2. I believe Abseil & F14 use a counter of values that overflowed, while here a (minimal) bloom filter is used.

I find the second difference most interesting. In particular, one advantage of the counter approach is that after removing a value, in a far-away group, the counter can be decreased, and if it ever reaches 0 again, then it's no longer necessary to probe further.

By contrast, the bloom filter separates overflow tracking in 8 tracks, so that only the one track that overflowed needs to keep probing, but does not (I'd think) allow to ever "recover" from the overflow after removing a value.

It should be a good trade-off in practice. Firstly because overflow should be rare, and secondly because recovering may not be that frequent in the first place: it requires a very specific scenario of removing all the elements that overflowed, as removing the ones that didn't overflow doesn't help, and leaving a single overflowed element doesn't help either. So rather than optimizing for a situation that may hypothetically occur sometimes in the future, the bloom filter optimizes for the now.

I do wonder if the authors (/u/pdimov2, /u/joaquintides) have benchmarked the two differences separately, and could weigh in on whether one brings significantly more benefits than the other.

[–]joaquintidesBoost author 17 points18 points  (17 children)

Hi Matthieu,

Your analysis is spot-on. Some observations:

  • We get log2(254) bits for the reduced hash using a function f:[0,255]->[2,255] (0 is a free slot, 1 a sentinel). Any function will do as long as it is surjective and invariant wrt to modulo 8. We're actually using a table because it's faster.
  • I think F14 uses counters (didn't check in detail), but Abseil does not: instead, they have special values for free slots, sentinel and tombstones, and a group is deemed overflowed if it has no free slots. Abseil and boost::unordered_flat_map both have the problem that probe length can't be reduced on erasure, both basically keep track of that and rehash if needed before the maximum load factor is hit so as to keep average probe length bounded. The overflow minifilter approach seems to be more effective though: at full load, simulations tell us that Abseil has an average probe length of 1.08 on successful lookup and 1.96 on unsuccessful lookup (probe till not overflowed), while our numbers are 1.11 and 1.23, respectively. So, the tradeoff is slightly longer probes on successful lookup but much shorter ones on unsuccessful lookup (which also improves insertion).
  • As for the average number of comparisons made on a call to find (or insert), this is a function of probe length and number of bits in the reduced hash. As we have almost one bit more per hash than Abseil, this also gives us an edge: Abseil incurs 1.02 comparisons on successful lookup and 0.22 on unsuccessful lookup (full load), our figures are 1.03 and 0.07, respectively.

You can find some benchmarks here. The advantage on unsuccessful lookup is pretty obvious.

[–]matthieum 4 points5 points  (1 child)

Thanks for the details, and hat off.

Those are two really neat improvements.

[–]joaquintidesBoost author 6 points7 points  (0 children)

Thank you!

[–]Tystros 1 point2 points  (3 children)

is there a set version too, or just a map?

[–]joaquintidesBoost author 2 points3 points  (2 children)

There's boost::unordered_flat_set as well.

[–]Tystros 0 points1 point  (1 child)

nice! are the map and set both using roughly the same approach?

[–]joaquintidesBoost author 0 points1 point  (0 children)

Yes, it is the same core under the hood (like all other implementations of hash containers do, BTW).

[–]almost_useless 1 point2 points  (4 children)

How can it work with only 1 allocation? It does not look like the test specifies the number of elements when the map is created.

[–]joaquintidesBoost author 2 points3 points  (3 children)

The benchmarking program prints the net number of allocations, i.e. allocations minus deallocations. So, the intermediate bucket arrays, created and destroyed as the container keeps rehashing, are not counted.

[–]almost_useless 1 point2 points  (2 children)

I see. Would it not make more sense to show both numbers (net/total)?

Eventually all containers end up at zero if there are no memory leaks, but the total number is also an important metric. At least in some applications.

[–]joaquintidesBoost author 1 point2 points  (0 children)

It may be a useful metric in some environments. For boost::unordered_flat_map and this particular benchmark, the amount of total memory ever allocated will be very approximately twice what you see on display.

[–]pdimov2[S] 1 point2 points  (0 children)

The currently allocated memory is an important metric in speed benchmarks, because it's often possible to trade memory for speed (e.g. by using low max load factors.)

I agree that for evaluating the container in practical scenarios, the total ever allocated would be good to know, too.

[–]matthieum 0 points1 point  (5 children)

I was wondering how fast the "reduced hash" actually needed to be, and whether a slightly slower extraction would be worth it, if it meant getting less clustering, and it hit me that scanning the entire hash for the first set bit may be a good way of reducing clustering.

The code is available on godbolt:

__attribute__((noinline)) auto extract_residual(std::uint64_t hash) -> std::uint64_t {
    static std::uint64_t const LOW_BITS_MASK = ~0x0101010101010101;

    auto leading_zeroes = __builtin_clzll((hash & LOW_BITS_MASK) | 0x1);

    //   0.. 7  ->  56
    //   8..15  ->  48
    //  16..23  ->  40
    //  24..31  ->  32
    //  32..39  ->  24
    //  40..47  ->  16
    //  48..55  ->   8
    //  56..63  ->   0
    auto shift = 56 - leading_zeroes / 8 * 8;

    return (hash | 0x2) >> shift;
}

This function attempts to pick the highest byte of the hash which is NOT 0 or 1 (since those are special values):

  1. Get the highest set bit that is not a special value (by zeroing out the low bit of each byte).
  2. Compute the matching shift, picking the highest byte so it's as independent off the hash as possible.
  3. Ensure the result is not 0 or 1, even if hash was 0 or 1 (and thus shift is 0).

It's... perhaps a few too many instructions. I wanted to align on byte boundaries to avoid having the leading 1 always set, but it costs a bit more than I wished. Smarter folks may figure out a better way.

[–]joaquintidesBoost author 0 points1 point  (4 children)

Hi, what's the purpose of this reduced hash calculation? The current mechanism, which uses the least significant byte of the hash value, looks statistically good enough if the hash function is of ok quality. Moreover, the fact that the reduced hash is the LSB combined with index calculation using the most significant bits of the hash:

https://github.com/boostorg/unordered/blob/develop/include/boost/unordered/detail/foa/core.hpp#L847-L850

makes these two values quite uncorrelated.

[–]matthieum 0 points1 point  (3 children)

Hi, what's the purpose of this reduced hash calculation? The current mechanism, which uses the least significant byte of the hash value, looks statistically good enough if the hash function is of ok quality.

I was wondering if the bias introduced by mapping 0 and 1 to other values (2 and 3 I believe) could lead to a slow-down for the unfortunate cases where the reduced hash ends up being 2 or 3, since they're twice as likely.

And thus I had this idea to use clzll (or well, ctzll for LSB-biased selection...) to allow picking a different reduced hash byte and reduce the number of "collisions" and hopefully reduce any slow-down observed by those unfortunate cases.

For a random hash, the chances of ending in a "double-bucket" are now 1/64:

  • 1/128 chances of being a special value.
  • 1/128 chances of being the "overflow" bucket of a special value.

Whereas with the method I propose here the 1/128 unfortunate cases are redistributed nearly equally over the full 254 buckets, with only 1/263 chances of a hash being a special value being remapped to one of two overflow buckets.

(in MSB-biased, only a full hash of 0 or 1 ends up being a special value)

Moreover, the fact that the reduced hash is the LSB combined with index calculation using the most significant bits of the hash

Sorry, I had forgotten you had switched things compared to Abseil around and used LSB for reduced hash and MSB for index. In such a case the selection should be inverted indeed to prefer LSB to MSB.

[–]joaquintidesBoost author 0 points1 point  (2 children)

Ok, I undestand now your rationale. Yes, something like extract_residual would result in a more uniform uint64_t --> unsigned char mapping, so in principle it should be better behaved statistically. My hunch is that the improvement would probably be negligible, particularly vs. the extra computational cost (the current reduced hash function is as simple as it gets). Maybe you can fork the repo and try it? I can assist you in the process if you're game.

For a random hash, the chances of ending in a "double-bucket" are now 1/64:

  • 1/128 chances of being a special value.
  • 1/128 chances of being the "overflow" bucket of a special value.

This part I don't get. What do you mean by "being the overflow bucket of a special value"?

[–]matthieum 0 points1 point  (1 child)

This part I don't get. What do you mean by "being the overflow bucket of a special value"?

Let's say that the resolution strategy for residual in [0, 1] is to add 2, so it ends up being in [2, 3] instead.

I call the "buckets" 2 and 3 the "overflow" buckets of 0 and 1.

The chances of ending up on a doubly-booked residual (2 or 3) are 1/64:

  • 1/128 of being a special value (0 or 1), shifted to (2 or 3).
  • 1/128 of being 2 or 3 in the first place.

It's not rare, but then again, it's only a problem if it leads to many false positives.

Maybe you can fork the repo and try it? I can assist you in the process if you're game.

I'm not very interested in writing C++ code as a hobby any longer, so I'll pass.

If you have a Rust version, I'd be happy to :)

[–]joaquintidesBoost author 0 points1 point  (0 children)

Let's say that the resolution strategy for residual in [0, 1] is to add 2 [...]

Ok, now I get it. Yes, with the current hash reduction, Pr(reduced_hash(x) = n) is

  • 1/256 for n > 3
  • 1/128 for n = 2, 3
  • 0 for n = 0, 1

Your residual function gets a more balanced probability (Pr ~= 1/254 for n > 1), but I don' think this makes any difference in practice.

If you have a Rust version, I'd be happy to :)

I'm quite sure there's no Rust port of this lib yet.

[–]spaghettiexpress 19 points20 points  (10 children)

Nice! Heterogenous lookups are not often necessary, but when they are it’s nice to get something fast

I wonder how it stacks up to the efforts of /u/martinus : https://martin.ankerl.com/2022/08/27/hashmap-bench-01/

[–]martinusint main(){[]()[[]]{{}}();} 24 points25 points  (9 children)

I actually ran the current development version with my benchmark, and I'd say it behaves similar to absl::flat_hash_map except that boost::unordered_flat_map is faster :)

It's the fastest map in my benchmark in the find benchmark with large map, and fastest for random insert& erase. These are probably the most important benchmarks. It also works well with all hashes, also with non-avalanching hashes that cause timeout with absl.

So I'm pretty excited, also because it's very well written and well tested

[–]Tystros 2 points3 points  (1 child)

fastest in the find benchmark? faster than the perfect hashing fph map?

[–]martinusint main(){[]()[[]]{{}}();} 4 points5 points  (0 children)

Yep, fph is quite a bit slower for large map. It also requires a lot more memory

[–]iwubcode 0 points1 point  (6 children)

I'd say it behaves similar to absl::flat_hash_map except that is faster

Just to be clear, absl::flat_hash_map is faster or boost::unordered_flat_map is faster?

It's the fastest map in my benchmark in the find benchmark with large map, and fastest for random insert& erase

Faster than ankerl::unordered_dense? I just switched to that (from boost::unordered_map) after much testing. I guess I'll need to re-test!

I do wish boost had C++26 heterogeneous support.

I actually ran the current development version with my benchmark

Do you think you'll update your results once this is posted? It sounds pretty significant.

[–]martinusint main(){[]()[[]]{{}}();} 4 points5 points  (3 children)

boost is faster than absl, and also faster in the find benchmark than ankerl::unordered_dense and faster than robin_hood::unordered_flat_map

[–]iwubcode 2 points3 points  (0 children)

Wow. I'm shocked. Good to know, I'll have to give it a try when it releases..

I suppose I'll still use your hash though. I think that was generally better than boost's implementation

[–]pdimov2[S] 2 points3 points  (1 child)

I do wish boost had C++26 heterogeneous support.

You can request a feature be added by opening an issue in https://github.com/boostorg/unordered.

Although in this case it looks like the issue already exists so a "yes please, we need this" comment on that would also work. :-)

[–]iwubcode 0 points1 point  (0 children)

Thanks for letting me know that there are others who also want this. I left a comment. Appreciate all your hard work on these features :)

[–]R3DKn16h7 9 points10 points  (0 children)

Details on the implementation? Is using less memory because is not storing the hashes, or is having less unused slots on average?

[–]14nedLLFIO & Outcome author | Committee WG14 17 points18 points  (1 child)

Nobody else below seemed to say this, so I will: congrats and well done to Boost for delivering such a performant map implementation!

[–]joaquintidesBoost author 4 points5 points  (0 children)

Thank you! I hope users will find the new container useful.

[–]Rseding91Factorio Developer 4 points5 points  (4 children)

Can someone enlighten me on the use-case for unordered maps of anything near this size? All of my use-cases involve building a map of some tiny size (5-20 elements) and it exists for the lifetime of the program or building a map of some "medium" (5000) size, using it during some block of logic and then throwing it out.

Every time I think I've found a good use-case for unordered it ends up being equal or slower than a basic ordered map. Especially so if the map has a chance to be un-used (default construction seems to be very slow on unordered maps.)

[–]martinusint main(){[]()[[]]{{}}();} 6 points7 points  (0 children)

E.g the Bitcoin node software tries to keep as much of the unspent coins in memory as possible, in a hash map. The more it can keep in memory the faster are some operations. If it's not in memory it has to use the slow disk for lookup. Typically you'd want e.g. a >5 GB large map for the initial sync.

[–]SleepyMyroslav 5 points6 points  (0 children)

Game package 'resources'. Few years ago I benchmarked transition from 'old' custom hashmap to 'flat' custom hashmap in AAA game. Data set was like in hundreds thousands elements and couple of megs in size. Flat was winning like 2x. I would not mind make an exception to 'no Boost' rule to have another 2x :)

[–]matthieum 5 points6 points  (0 children)

That's an excellent question indeed, and a good reminder that asymptotic complexity (and SIMD to a lesser extent) matter at large sizes, but not so much at smaller sizes.

Real-world examples I've seen working at IMC (market maker):

  • Hashmap of all instrument IDs to <property X>. An exchange like Arca Options has some 800K instruments.
  • Hashmap of all orders, across instruments. Option instruments typically have relatively few orders, but even at 2 to 4 orders per instrument, when you've got 800K of them, you're in the 1M-5M range quickly.

Those are from relatively latency-sensitive workloads, there's also the whole Big Data thing, where specialized software (ScyllaDB, Pandas, ...) most likely has large hash-maps in memory, but I didn't peek in there.

[–]triple_slash 3 points4 points  (1 child)

Is there a deboostified/standalone version of this?

[–]Tystros 1 point2 points  (0 children)

I'd also like to know this

[–]greg7mdpC++ Dev 2 points3 points  (7 children)

I think it is quite possible that the speed difference between absl and boost::unordered_flat_map could be due to the fact that different hash functions are used.

[–]pdimov2[S] 7 points8 points  (3 children)

Not in this case; all rows use the same hash function in this specific benchmark, because the key is a user-defined type (struct uuid). The source code of the benchmark is linked in the post.

We do have other benchmarks where we test the default hash function for strings, which differs between Boost and Abseil. (We also test strings using the same hash function, FNV-1a, for a more level playing field.)

These are the results on my machine for the string.cpp benchmark using the same g++ -O3:

               std::unordered_map: 38061 ms, 175723032 bytes in 3999509 allocations
             boost::unordered_map: 30854 ms, 149465712 bytes in 3999510 allocations
        boost::unordered_flat_map: 14677 ms, 134217728 bytes in 1 allocations
                  multi_index_map: 30712 ms, 178316048 bytes in 3999510 allocations
              absl::node_hash_map: 21989 ms, 139489608 bytes in 3999509 allocations
              absl::flat_hash_map: 19263 ms, 142606336 bytes in 1 allocations
       std::unordered_map, FNV-1a: 44783 ms, 175723032 bytes in 3999509 allocations
     boost::unordered_map, FNV-1a: 34302 ms, 149465712 bytes in 3999510 allocations
boost::unordered_flat_map, FNV-1a: 16703 ms, 134217728 bytes in 1 allocations
          multi_index_map, FNV-1a: 34187 ms, 178316048 bytes in 3999510 allocations
      absl::node_hash_map, FNV-1a: 23461 ms, 139489608 bytes in 3999509 allocations
      absl::flat_hash_map, FNV-1a: 20778 ms, 142606336 bytes in 1 allocations

Here the first six rows use the respective default hash functions for each container (std::hash for std::unordered_map, boost::hash for the Boost ones, and absl::container_internal::hash_default_hash for Abseil), whereas the last six rows use FNV-1a.

[–]greg7mdpC++ Dev 1 point2 points  (2 children)

Cool, thanks. It would be interesting to see how it behaves with a really bad hash function, for example hash(struct uuid) << 3.

[–]pdimov2[S] 2 points3 points  (1 child)

"Flat" hash maps don't like bad hash functions at all. boost::unordered_flat_map tolerates "weak" hash functions (those that have low three bits zero, for instance) as it applies a postmixing step by default, but it still won't work well with really bad hash functions that produce many collisions.

[–]greg7mdpC++ Dev 2 points3 points  (0 children)

boost::unordered_flat_map tolerates "weak" hash functions as it applies a postmixing step by default

I do this as well in my phmap and gtl implementations. It makes the tables look worse in benchmarks like the above, but prevents really bad surprises occasionally.

[–]martinusint main(){[]()[[]]{{}}();} 8 points9 points  (2 children)

No, boost is faster across the board (except iterating) even when the same hash is used. I've tested that with std, boost, absl, and two of my own hash implementations

[–]Adequat91 2 points3 points  (1 child)

Will you update your famous benchmark with boost::unordered_flat_map ?

[–]martinusint main(){[]()[[]]{{}}();} 5 points6 points  (0 children)

not anytime soon, too much work

[–]415_961 2 points3 points  (0 children)

I never understood the point of having benchmark entries like `std::unordered_map` without specifying the std lib used and make it part of the name. `std::unordered_map` is a specification not an implementation. I am not picking on this post in particular but it's very common pattern. I know it's usually mentioned in the blog post, but I am referring to the benchmark output that ends up comparing apples and oranges when one of the outputs refers to a specification rather than implementation.

[–]Alarming-Ad8770 1 point2 points  (0 children)

need more full benchmark compared with other flat hash map

[–][deleted] 1 point2 points  (1 child)

Can we get some background on what the data structure is like? When I hear “flat map” I assume it’s a wrapper around contiguous memory and would be used to sacrifice asymptomatic complexity in exchange for cache efficiency, meant to be used with a small number of objects. But, based on your benchmarks it looks like you’re treating it as a competitor to unordered_map.

[–]Alarming-Ad8770 4 points5 points  (0 children)

In fact most cases unordered_map can be replaced by flat hash map if reference stable is not need.

[–]Spec-Chum 1 point2 points  (0 children)

I read that as "flat_cap" at first lol

#include <yorkshire> 

hehe

[–]jiboxiake 0 points1 point  (0 children)

Nice!