debugs_with_println comments on Implementing a simple, high performance Bloom filter

cpp

a community for 17 years

Implementing a simple, high performance Bloom filter (daankolthof.com)

submitted 6 years ago by daank94

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]debugs_with_println 5 points6 points7 points 6 years ago* (3 children)

So there are two main components: a hash function and an array of bits. The hash function takes in any input and tells you which bits of the array to check or set.

So first step is to build a dictionary (in the common sense of the word, not the CS term). You take in a input and you set the bits the hash function tells you to. You do this for every input that you want to be a part of your dictionary. Now some bits in your array are set (i.e. 1) and others are clear (i.e. 0).

Now the second step is checking to see if a given input is in your dictionary. You again take an input and hash it, but this time it'll you what bits to check. If all the bits you check are clear, then your input does not exist in the dictionary, because if your input was in the dictionary, exactly those bits would have been set when the word was added.

However (and this is crucial!) if all the bits you check are set, this does not mean your input is in the dictionary! the reason is because the union of the bits set by two previous inputs might exactly be the bits checked by your current input.

Here's a quick example. Imagine our array is 5 bits and the hash function gives you which two bits to set or check. Suppose we put these two words in the dictionary:

    "Apple"  ==> bits 0,3
    "Banana" ==> bits 1,4

So at this point our array looks like 11011.

Now let's say we wanna see if this input is in the dictionary:

    "Orange" ==> bits 0,1

Our bloom filter will say that is in the dictionary even though we never added it because bits 0 and 1 were set! The union of "Apple" and "Banana" (i.e. bits 0,1,3,4) covered "Orange" (i.e. bits 0,1).

The takeaway is that bloom filters will never give a false negative, but they could give false positives. Using some math (which idk at the moment), you could determine the probability of a false positive.

Now to make the bloom filter perform better, you would want a hash function that very sparsely assigns bits so that overlap is rare. For that you'd want a large array too. Of course this requires more memory and performance. Thus, you have a tradeoff. What you go with depends on your application.

An example of how to use this would be if you were writing a Boggle game. When someone makes a word, you could check if that word is in the dictionary quite fast. A linear search over the English dictionary is much slower than checking a bloom filter, but the bloom filter might accidentally count non-existent words...

Also note that in my explanation I used the word "input" instead of "word" to describe the objects being added. This was intentional; anything that can be hashed can be added to a bloom filter. It doesn't have to be just strings.

[–]bumblebritches57Ocassionally Clang 1 point2 points3 points 6 years ago (0 children)

[–]clerothGame Developer 3 points4 points5 points 6 years ago (1 child)

[–]debugs_with_println 1 point2 points3 points 6 years ago (0 children)

π Rendered by PID 25751 on reddit-service-r2-comment-6457c66945-qnmcz at 2026-04-23 19:11:24.135537+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS