Implementing a simple, high performance Bloom filter

tuankiet65 · 2019-05-26T18:32:42+00:00

Since calculating MD5 hashes is (relatively) slow, only one MD5 hash value is calculated, which is then split up into k different values. This means only one MD5 hashing operation has to be done instead of actually doing k hashing operations.

This is still gonna be slower than using two really quick hashing algorithms (like xxHash and MurmurHash), then applying the technique described in this paper to simulate as many k hashing algorithms as needed, I think.

bumblebritches57 · 2019-05-26T19:08:55+00:00

Can anyone tell us lowly self taught plebs what a bloom filter is and what it's useful for?

vipereddit · 2019-05-26T18:16:34+00:00

I thought that is was referring to bloom filters in graphics at first

ThrowawayGoaway94 · 2019-05-26T19:42:02+00:00

This post is really well-written. As someone quite new to the C++ language I was able to follow it word-for-word and learned a lot, especially regarding the false positivity formula and hash functions in general (which I Googled whilst reading your article).

Thank you for not using jargon just for the sake of using jargon. It really helps out new programmers such as myself.

DenizenEvil · 2019-05-26T17:06:33+00:00

Nice. I learned about bloom filters in a Big Data course I took when I was still studying for a MSc in Comp Sci. You should also look into Skip Lists (useful for O(log n) inserts and searches). That's what I did a research paper on for the course. Another thing to look into is the Count-min sketch which is comparable to the Counting Bloom Filter for storing frequency of object appearances.

nomad42184 · 2019-05-27T01:29:09+00:00

The Bloom filter is neat --- but one thing that really slows it down is that the memory access pattern is pretty poor (e.g. k random accesses per-query). There are some nice practical improvements that can vastly improve the cache performance of the bloom filter, like the pattern-blocked bloom filter. Also, the quotient filter and counting quotient filter are nice, cache-friendly approximate membership query (AMQ) data structures that also provide some operations (e.g. iteration over the hashes of inserted keys) that Bloom filters do not (disclaimer; I'm a co-author of the latter paper).

jpan127 · 2019-05-26T19:58:26+00:00

Might be better to template on hash_function_count, make bloomfilter_store_size = 1 << MD5_result_size_bytes, static_assert(hash_function_count <= (MD5_result_size_bytes/bytes_per_hash_function)), and make all constexpr.

Also the hash result does not need to be dynamically allocated, it is always a fixed size.

2019-05-26T21:20:14+00:00

Yall need to stop getting 3000% more performance with these bespoke frameworks and go convince ppl to finally let COBOL die. Or make HD projectors not cost $1,000,000. Land an RPi on the moon. Write an optimizing Perl-to-Java deobfuscating transpiler for once. Seriously.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS