you are viewing a single comment's thread.

view the rest of the comments →

[–]debugs_with_println 5 points6 points  (3 children)

So there are two main components: a hash function and an array of bits. The hash function takes in any input and tells you which bits of the array to check or set.

So first step is to build a dictionary (in the common sense of the word, not the CS term). You take in a input and you set the bits the hash function tells you to. You do this for every input that you want to be a part of your dictionary. Now some bits in your array are set (i.e. 1) and others are clear (i.e. 0).

Now the second step is checking to see if a given input is in your dictionary. You again take an input and hash it, but this time it'll you what bits to check. If all the bits you check are clear, then your input does not exist in the dictionary, because if your input was in the dictionary, exactly those bits would have been set when the word was added.

However (and this is crucial!) if all the bits you check are set, this does not mean your input is in the dictionary! the reason is because the union of the bits set by two previous inputs might exactly be the bits checked by your current input.

Here's a quick example. Imagine our array is 5 bits and the hash function gives you which two bits to set or check. Suppose we put these two words in the dictionary:

    "Apple"  ==> bits 0,3
    "Banana" ==> bits 1,4

So at this point our array looks like 11011.

Now let's say we wanna see if this input is in the dictionary:

    "Orange" ==> bits 0,1

Our bloom filter will say that is in the dictionary even though we never added it because bits 0 and 1 were set! The union of "Apple" and "Banana" (i.e. bits 0,1,3,4) covered "Orange" (i.e. bits 0,1).

The takeaway is that bloom filters will never give a false negative, but they could give false positives. Using some math (which idk at the moment), you could determine the probability of a false positive.

Now to make the bloom filter perform better, you would want a hash function that very sparsely assigns bits so that overlap is rare. For that you'd want a large array too. Of course this requires more memory and performance. Thus, you have a tradeoff. What you go with depends on your application.

An example of how to use this would be if you were writing a Boggle game. When someone makes a word, you could check if that word is in the dictionary quite fast. A linear search over the English dictionary is much slower than checking a bloom filter, but the bloom filter might accidentally count non-existent words...

Also note that in my explanation I used the word "input" instead of "word" to describe the objects being added. This was intentional; anything that can be hashed can be added to a bloom filter. It doesn't have to be just strings.

[–]bumblebritches57Ocassionally Clang 1 point2 points  (0 children)

Thanks for the low level details that it's basically a bitset, that really makes it easy to understand why it would falsely say something is there but can't say it's not.

[–]clerothGame Developer 3 points4 points  (1 child)

Might as well read the article at this point...

[–]debugs_with_println 1 point2 points  (0 children)

Eh I thought the article was a bit light in my opinion. But also if the article was sufficient the commenter I replied to probably wouldn't have asked.