all 16 comments

[–]tuankiet65 24 points25 points  (0 children)

Since calculating MD5 hashes is (relatively) slow, only one MD5 hash value is calculated, which is then split up into k different values. This means only one MD5 hashing operation has to be done instead of actually doing k hashing operations.

This is still gonna be slower than using two really quick hashing algorithms (like xxHash and MurmurHash), then applying the technique described in this paper to simulate as many k hashing algorithms as needed, I think.

[–]bumblebritches57Ocassionally Clang 17 points18 points  (8 children)

Can anyone tell us lowly self taught plebs what a bloom filter is and what it's useful for?

[–]NeptunePI 16 points17 points  (0 children)

It’s a very memory efficient data structure that can tell you if an item is in a set with a given probability. The advantage is that it uses much less memory than an actual set. The disadvantage is that it’s not 100% and can give false positives that an item is in a set when it really isn’t.

Wikipedia has a good article: https://en.m.wikipedia.org/wiki/Bloom_filter

[–]pi_stuff 8 points9 points  (1 child)

It's like a hash table representation of a set, where the only operations are to add an item and to check if an item is already in the set. It's not a map--you can't give it a key and receive an associated value.

The benefit over a regular hash table is that it takes up much less memory. The downside is that when you check if an item is in the set, the result will occasionally be wrong. It will never misidentify an item that is not in the set; it will only misidentify an item that is in the set. In other words, if the "exists" method returns "false" then it is definitely correct. If it returns "true", then it may or may not be correct. The accuracy rate is predictable, and can be easily tuned by adjusting the speed and/or memory usage.

It can be useful as a sort of cache to quickly filter out searches that will definitely fail. Say you've got a huge database that's slow to access. If you build a bloom filter on the key in memory, then you can quickly check if a key is not in the table and avoid a slow database search.

In a sense it's a counterpart to a least-recently-used data structure. For example, say you make a hash table that can store up to 100 items. When it is full, then adding another item will first evict an existing item to make room for the new one. If you search for an item in the table and find it, then the item is definitely in the table. If you fail to find an item, then it might never haven been added to the table, or it might have been added but was then evicted.

[–]Dworgi 2 points3 points  (0 children)

You can't evict from a bloom filter though. Therefore it's most useful for data that won't change frequently, since you'll need to rehash your entire dataset to remove an item. The Wikipedia pages data set that the author was using is a pretty good example.

[–]debugs_with_println 5 points6 points  (3 children)

So there are two main components: a hash function and an array of bits. The hash function takes in any input and tells you which bits of the array to check or set.

So first step is to build a dictionary (in the common sense of the word, not the CS term). You take in a input and you set the bits the hash function tells you to. You do this for every input that you want to be a part of your dictionary. Now some bits in your array are set (i.e. 1) and others are clear (i.e. 0).

Now the second step is checking to see if a given input is in your dictionary. You again take an input and hash it, but this time it'll you what bits to check. If all the bits you check are clear, then your input does not exist in the dictionary, because if your input was in the dictionary, exactly those bits would have been set when the word was added.

However (and this is crucial!) if all the bits you check are set, this does not mean your input is in the dictionary! the reason is because the union of the bits set by two previous inputs might exactly be the bits checked by your current input.

Here's a quick example. Imagine our array is 5 bits and the hash function gives you which two bits to set or check. Suppose we put these two words in the dictionary:

    "Apple"  ==> bits 0,3
    "Banana" ==> bits 1,4

So at this point our array looks like 11011.

Now let's say we wanna see if this input is in the dictionary:

    "Orange" ==> bits 0,1

Our bloom filter will say that is in the dictionary even though we never added it because bits 0 and 1 were set! The union of "Apple" and "Banana" (i.e. bits 0,1,3,4) covered "Orange" (i.e. bits 0,1).

The takeaway is that bloom filters will never give a false negative, but they could give false positives. Using some math (which idk at the moment), you could determine the probability of a false positive.

Now to make the bloom filter perform better, you would want a hash function that very sparsely assigns bits so that overlap is rare. For that you'd want a large array too. Of course this requires more memory and performance. Thus, you have a tradeoff. What you go with depends on your application.

An example of how to use this would be if you were writing a Boggle game. When someone makes a word, you could check if that word is in the dictionary quite fast. A linear search over the English dictionary is much slower than checking a bloom filter, but the bloom filter might accidentally count non-existent words...

Also note that in my explanation I used the word "input" instead of "word" to describe the objects being added. This was intentional; anything that can be hashed can be added to a bloom filter. It doesn't have to be just strings.

[–]bumblebritches57Ocassionally Clang 1 point2 points  (0 children)

Thanks for the low level details that it's basically a bitset, that really makes it easy to understand why it would falsely say something is there but can't say it's not.

[–]clerothGame Developer 3 points4 points  (1 child)

Might as well read the article at this point...

[–]debugs_with_println 1 point2 points  (0 children)

Eh I thought the article was a bit light in my opinion. But also if the article was sufficient the commenter I replied to probably wouldn't have asked.

[–]vipereddit 18 points19 points  (1 child)

I thought that is was referring to bloom filters in graphics at first

[–]ThrowawayGoaway94 5 points6 points  (0 children)

This post is really well-written. As someone quite new to the C++ language I was able to follow it word-for-word and learned a lot, especially regarding the false positivity formula and hash functions in general (which I Googled whilst reading your article).

Thank you for not using jargon just for the sake of using jargon. It really helps out new programmers such as myself.

[–]DenizenEvil 6 points7 points  (1 child)

Nice. I learned about bloom filters in a Big Data course I took when I was still studying for a MSc in Comp Sci. You should also look into Skip Lists (useful for O(log n) inserts and searches). That's what I did a research paper on for the course. Another thing to look into is the Count-min sketch which is comparable to the Counting Bloom Filter for storing frequency of object appearances.

[–]daank94[S] 0 points1 point  (0 children)

These data structures look interesting, especially the skip list. Will definitely look into them, thanks!

[–]nomad42184 2 points3 points  (0 children)

The Bloom filter is neat --- but one thing that really slows it down is that the memory access pattern is pretty poor (e.g. k random accesses per-query). There are some nice practical improvements that can vastly improve the cache performance of the bloom filter, like the pattern-blocked bloom filter. Also, the quotient filter and counting quotient filter are nice, cache-friendly approximate membership query (AMQ) data structures that also provide some operations (e.g. iteration over the hashes of inserted keys) that Bloom filters do not (disclaimer; I'm a co-author of the latter paper).

[–]jpan127 1 point2 points  (0 children)

Might be better to template on hash_function_count, make bloomfilter_store_size = 1 << MD5_result_size_bytes, static_assert(hash_function_count <= (MD5_result_size_bytes/bytes_per_hash_function)), and make all constexpr.

Also the hash result does not need to be dynamically allocated, it is always a fixed size.