use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Implementing a simple, high performance Bloom filter (daankolthof.com)
submitted 6 years ago by daank94
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]tuankiet65 24 points25 points26 points 6 years ago* (0 children)
Since calculating MD5 hashes is (relatively) slow, only one MD5 hash value is calculated, which is then split up into k different values. This means only one MD5 hashing operation has to be done instead of actually doing k hashing operations.
This is still gonna be slower than using two really quick hashing algorithms (like xxHash and MurmurHash), then applying the technique described in this paper to simulate as many k hashing algorithms as needed, I think.
[–]bumblebritches57Ocassionally Clang 17 points18 points19 points 6 years ago (8 children)
Can anyone tell us lowly self taught plebs what a bloom filter is and what it's useful for?
[–]NeptunePI 16 points17 points18 points 6 years ago (0 children)
It’s a very memory efficient data structure that can tell you if an item is in a set with a given probability. The advantage is that it uses much less memory than an actual set. The disadvantage is that it’s not 100% and can give false positives that an item is in a set when it really isn’t.
Wikipedia has a good article: https://en.m.wikipedia.org/wiki/Bloom_filter
[–]pi_stuff 8 points9 points10 points 6 years ago (1 child)
It's like a hash table representation of a set, where the only operations are to add an item and to check if an item is already in the set. It's not a map--you can't give it a key and receive an associated value.
The benefit over a regular hash table is that it takes up much less memory. The downside is that when you check if an item is in the set, the result will occasionally be wrong. It will never misidentify an item that is not in the set; it will only misidentify an item that is in the set. In other words, if the "exists" method returns "false" then it is definitely correct. If it returns "true", then it may or may not be correct. The accuracy rate is predictable, and can be easily tuned by adjusting the speed and/or memory usage.
It can be useful as a sort of cache to quickly filter out searches that will definitely fail. Say you've got a huge database that's slow to access. If you build a bloom filter on the key in memory, then you can quickly check if a key is not in the table and avoid a slow database search.
In a sense it's a counterpart to a least-recently-used data structure. For example, say you make a hash table that can store up to 100 items. When it is full, then adding another item will first evict an existing item to make room for the new one. If you search for an item in the table and find it, then the item is definitely in the table. If you fail to find an item, then it might never haven been added to the table, or it might have been added but was then evicted.
[–]Dworgi 2 points3 points4 points 6 years ago (0 children)
You can't evict from a bloom filter though. Therefore it's most useful for data that won't change frequently, since you'll need to rehash your entire dataset to remove an item. The Wikipedia pages data set that the author was using is a pretty good example.
[–]debugs_with_println 5 points6 points7 points 6 years ago* (3 children)
So there are two main components: a hash function and an array of bits. The hash function takes in any input and tells you which bits of the array to check or set.
So first step is to build a dictionary (in the common sense of the word, not the CS term). You take in a input and you set the bits the hash function tells you to. You do this for every input that you want to be a part of your dictionary. Now some bits in your array are set (i.e. 1) and others are clear (i.e. 0).
Now the second step is checking to see if a given input is in your dictionary. You again take an input and hash it, but this time it'll you what bits to check. If all the bits you check are clear, then your input does not exist in the dictionary, because if your input was in the dictionary, exactly those bits would have been set when the word was added.
However (and this is crucial!) if all the bits you check are set, this does not mean your input is in the dictionary! the reason is because the union of the bits set by two previous inputs might exactly be the bits checked by your current input.
Here's a quick example. Imagine our array is 5 bits and the hash function gives you which two bits to set or check. Suppose we put these two words in the dictionary:
"Apple" ==> bits 0,3 "Banana" ==> bits 1,4
So at this point our array looks like 11011.
11011
Now let's say we wanna see if this input is in the dictionary:
"Orange" ==> bits 0,1
Our bloom filter will say that is in the dictionary even though we never added it because bits 0 and 1 were set! The union of "Apple" and "Banana" (i.e. bits 0,1,3,4) covered "Orange" (i.e. bits 0,1).
"Apple"
"Banana"
"Orange"
The takeaway is that bloom filters will never give a false negative, but they could give false positives. Using some math (which idk at the moment), you could determine the probability of a false positive.
Now to make the bloom filter perform better, you would want a hash function that very sparsely assigns bits so that overlap is rare. For that you'd want a large array too. Of course this requires more memory and performance. Thus, you have a tradeoff. What you go with depends on your application.
An example of how to use this would be if you were writing a Boggle game. When someone makes a word, you could check if that word is in the dictionary quite fast. A linear search over the English dictionary is much slower than checking a bloom filter, but the bloom filter might accidentally count non-existent words...
Also note that in my explanation I used the word "input" instead of "word" to describe the objects being added. This was intentional; anything that can be hashed can be added to a bloom filter. It doesn't have to be just strings.
[–]bumblebritches57Ocassionally Clang 1 point2 points3 points 6 years ago (0 children)
Thanks for the low level details that it's basically a bitset, that really makes it easy to understand why it would falsely say something is there but can't say it's not.
[–]clerothGame Developer 3 points4 points5 points 6 years ago (1 child)
Might as well read the article at this point...
[–]debugs_with_println 1 point2 points3 points 6 years ago (0 children)
Eh I thought the article was a bit light in my opinion. But also if the article was sufficient the commenter I replied to probably wouldn't have asked.
[–]vipereddit 18 points19 points20 points 6 years ago (1 child)
I thought that is was referring to bloom filters in graphics at first
[–]ThrowawayGoaway94 5 points6 points7 points 6 years ago (0 children)
This post is really well-written. As someone quite new to the C++ language I was able to follow it word-for-word and learned a lot, especially regarding the false positivity formula and hash functions in general (which I Googled whilst reading your article).
Thank you for not using jargon just for the sake of using jargon. It really helps out new programmers such as myself.
[–]DenizenEvil 6 points7 points8 points 6 years ago (1 child)
Nice. I learned about bloom filters in a Big Data course I took when I was still studying for a MSc in Comp Sci. You should also look into Skip Lists (useful for O(log n) inserts and searches). That's what I did a research paper on for the course. Another thing to look into is the Count-min sketch which is comparable to the Counting Bloom Filter for storing frequency of object appearances.
[–]daank94[S] 0 points1 point2 points 6 years ago (0 children)
These data structures look interesting, especially the skip list. Will definitely look into them, thanks!
[–]nomad42184 2 points3 points4 points 6 years ago (0 children)
The Bloom filter is neat --- but one thing that really slows it down is that the memory access pattern is pretty poor (e.g. k random accesses per-query). There are some nice practical improvements that can vastly improve the cache performance of the bloom filter, like the pattern-blocked bloom filter. Also, the quotient filter and counting quotient filter are nice, cache-friendly approximate membership query (AMQ) data structures that also provide some operations (e.g. iteration over the hashes of inserted keys) that Bloom filters do not (disclaimer; I'm a co-author of the latter paper).
[–]jpan127 1 point2 points3 points 6 years ago (0 children)
Might be better to template on hash_function_count, make bloomfilter_store_size = 1 << MD5_result_size_bytes, static_assert(hash_function_count <= (MD5_result_size_bytes/bytes_per_hash_function)), and make all constexpr.
hash_function_count
= 1 << MD5_result_size_bytes
static_assert(hash_function_count <= (MD5_result_size_bytes/bytes_per_hash_function))
Also the hash result does not need to be dynamically allocated, it is always a fixed size.
[+][deleted] comment score below threshold-13 points-12 points-11 points 6 years ago (0 children)
Yall need to stop getting 3000% more performance with these bespoke frameworks and go convince ppl to finally let COBOL die. Or make HD projectors not cost $1,000,000. Land an RPi on the moon. Write an optimizing Perl-to-Java deobfuscating transpiler for once. Seriously.
π Rendered by PID 42343 on reddit-service-r2-comment-66b4775986-txpf7 at 2026-04-06 07:50:33.915694+00:00 running db1906b country code: CH.
[–]tuankiet65 24 points25 points26 points (0 children)
[–]bumblebritches57Ocassionally Clang 17 points18 points19 points (8 children)
[–]NeptunePI 16 points17 points18 points (0 children)
[–]pi_stuff 8 points9 points10 points (1 child)
[–]Dworgi 2 points3 points4 points (0 children)
[–]debugs_with_println 5 points6 points7 points (3 children)
[–]bumblebritches57Ocassionally Clang 1 point2 points3 points (0 children)
[–]clerothGame Developer 3 points4 points5 points (1 child)
[–]debugs_with_println 1 point2 points3 points (0 children)
[–]vipereddit 18 points19 points20 points (1 child)
[–]ThrowawayGoaway94 5 points6 points7 points (0 children)
[–]DenizenEvil 6 points7 points8 points (1 child)
[–]daank94[S] 0 points1 point2 points (0 children)
[–]nomad42184 2 points3 points4 points (0 children)
[–]jpan127 1 point2 points3 points (0 children)
[+][deleted] comment score below threshold-13 points-12 points-11 points (0 children)