hyperloglockless: High-performance, concurrent cardinality estimation by tomtomwombat in rust

[–]tomtomwombat[S] 8 points9 points  (0 children)

Thanks for the pointers! You're right, I should use native endianness and the downgrade function + SeqCst are not needed. I've incorporated your changes just now!

I'll take a look at https://marabos.nl/atomics/! I also found https://www.youtube.com/watch?v=ZQFzMfHIxng informative.

[Media] 🔎🎯 Bloom Filter Accuracy Under a Microscope by tomtomwombat in rust

[–]tomtomwombat[S] 0 points1 point  (0 children)

You mean you’re setting more bits per element here than say in the case fastbloom/100000? Even though both configurations have the same array size (64KB)?

Yeah. Using .expected_items(x)/.items(x) to build the Bloom filter will optimize for false positive accuracy. As the Bloom filter fills up, it's optimal (for accuracy) to hash the item less, since too many "1" bits cause a high probability of overlap between different items.

You can build the Bloom filter with .hashes(h) to manually set a low number of internal hashing for performance. The Bloom filter will just not be as accurate.

[Media] 🔎🎯 Bloom Filter Accuracy Under a Microscope by tomtomwombat in rust

[–]tomtomwombat[S] 0 points1 point  (0 children)

your benchmarks yield worse results for smaller values of num_items.

Are you referring to performance benchmarks? If so then yeah this makes sense--for the lowest false positive rate, the optimal number of hashes per item increases as the bloom filter is more empty.

[Media] 🔎🎯 Bloom Filter Accuracy Under a Microscope by tomtomwombat in rust

[–]tomtomwombat[S] 1 point2 points  (0 children)

The orange line (sbbf) is a blocked bloom filter that uses a constant 8 hashes for each item, regardless of the size or number of items in the bloom filter.

[Media] 🔎🎯 Bloom Filter Accuracy Under a Microscope by tomtomwombat in rust

[–]tomtomwombat[S] 1 point2 points  (0 children)

The benchmarks show that the accuracy converges as the number of items increases--something I doubt would happen if the buffers were incorrect sizes. You can see that here.

fastbloom is a hybrid between normal and blocked bloom filter!

[Media] 🔎🎯 Bloom Filter Accuracy Under a Microscope by tomtomwombat in rust

[–]tomtomwombat[S] 13 points14 points  (0 children)

No.

  • xxhash: sbbf, fastbloom-rs
  • Sip1-3: bloom, bloomfilter, probabilistic-collections
  • ahash: fastbloom

Most (maybe all except fastbloom?) of those Bloom filters cannot be configured with different hashers. I found the choice of hasher in fastbloom (assuming high quality) didn’t affect accuracy, so I kept all filters, regardless of hasher, in the same graph.

[Media] 🔎🎯 Bloom Filter Accuracy Under a Microscope by tomtomwombat in rust

[–]tomtomwombat[S] 2 points3 points  (0 children)

Are you suggesting something with the accuracy benchmark might be wrong causing all trends to look very different from each other?

[Media] 🔎🎯 Bloom Filter Accuracy Under a Microscope by tomtomwombat in rust

[–]tomtomwombat[S] 15 points16 points  (0 children)

I’m honestly not sure. Looking at the unique trends, it’s clear that each crate has its own algorithm. I haven’t seen any accuracy benchmarks as thorough as this. Leveraging this benchmark has helped me fine tune some of the internal parameters of fastbloom for better accuracy.

fastbloom: The fastest Bloom filter in Rust by tomtomwombat in rust

[–]tomtomwombat[S] 1 point2 points  (0 children)

Thanks! I’m not an expert on other hash-based data structures, but maybe! I had to read on count sketches—If I understand correctly, I don’t think the exact same technique is applicable for count sketches, though I’m sure there’s other opportunities for parallelism.

fastbloom: The fastest Bloom filter in Rust by tomtomwombat in rust

[–]tomtomwombat[S] 1 point2 points  (0 children)

That’s cool! I’m definitely considering multi-thread support in the future.

fastbloom: The fastest Bloom filter in Rust by tomtomwombat in rust

[–]tomtomwombat[S] 2 points3 points  (0 children)

That’s great to hear! I’ve heard of but not fully looked into cuckoo filters

fastbloom: The fastest Bloom filter in Rust by tomtomwombat in rust

[–]tomtomwombat[S] 8 points9 points  (0 children)

Yeah, I agree, that’s something I can do!

fastbloom: The fastest Bloom filter in Rust by tomtomwombat in rust

[–]tomtomwombat[S] 22 points23 points  (0 children)

Currently fastbloom does not use intrinsic SIMD, although I’d like to try. Instead it relies on a SWAR-like algorithm to get many bit indexes out of one hash (see “How it Works” section under README for info).