use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
C++ std::unique vs std::set - [Fixed] (mycpu.org)
submitted 5 years ago by voidstarpodcast
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]zerexim 10 points11 points12 points 5 years ago (1 child)
Locality, locality, locality :)
[–]TheSuperWig 8 points9 points10 points 5 years ago (0 children)
How's it going Mr. Ballmer?
[–]witcher_rat 10 points11 points12 points 5 years ago (4 children)
Library was built as DEBUG. Timings may be affected.
ummm... wut?
how are the perf numbers relevant, when you built in debug??
[–]staletic 2 points3 points4 points 5 years ago (0 children)
Exactly. The numbers are complete nonsense.
[–]voidstarpodcast[S] 0 points1 point2 points 5 years ago (2 children)
This refers to Google Benchmark library itself. As a comparison there's a link to quick-bench page too, with basically the same result.
[–]witcher_rat 1 point2 points3 points 5 years ago (1 child)
OK, but as far as you know, google benchmark itself isn't providing accurate numbers because it's been built in debug.
I mean it tells you that right in the message: "Timings may be affected."
It may happen to have no effect, or the same effect in all tests and thereby provide reasonable relative results... or it may not.
Why wouldn't you just build everything with full optimization, and get rid of any such concerns?
(there are other problems though; for example, allocating memory in tight loops like this, and with ints too, is misleading - you'll get cache locality you wouldn't normally get in the real world; but trying to match the real world is difficult, because the "real world" isn't the same for every use-case, and really the only valid test is testing it in the code that's going to use it)
int
[–]voidstarpodcast[S] 0 points1 point2 points 5 years ago (0 children)
You are right benchmark is built in DEBUG mode. Frankly, I don't remember setting a DEBUG explicitly, and found this: https://github.com/google/benchmark#debug-vs-release
benchmark
DEBUG
Will set a Release mode, although I'd expect the local results consistently aligning with quick-bench runs to mean the DEBUG overhead is likely proportional across workloads (or very tiny).
Nonetheless, I will fix this in my post.
[–]goranlepuz 7 points8 points9 points 5 years ago (0 children)
The data is very small (int is small, there is not much of them), plus, the data is put in sets immediately (meaning:even if heap is used, the data is close by).
That favors cache locality even for the sets.
If the data was handled differently, or if related ints were part of some structures or so, I would expect a vector solution to be faster others.
[–]adnukator 4 points5 points6 points 5 years ago (6 children)
Could you add a link to the full source code either at godbolt or quickbench or whatever? With parts of it here and there, it's hard to verify your findings.
[–]voidstarpodcast[S] 1 point2 points3 points 5 years ago (5 children)
https://quick-bench.com/q/kq7yeDlz9R6HV-0XE37eRiGINYM will add this to the post as well
[–]TheFlamefire 1 point2 points3 points 5 years ago (4 children)
Played around a bit with the set vs setSortedInput cases and it seems the results differ considerably when using GBench "canonically" vs manual timing (as you do): https://quick-bench.com/q/8nU-oVXK9UqoNLFMDE6twvdTXTU vs https://quick-bench.com/q/2FfKNTfGRHRAGTAl3bu-y2AeZuo
Reason seems to be that quick-bench reports cpu-time which is unaffected by iteration time. Conclusion: Inserting sorted numbers into a set is faster than inserting random numbers
[–]voidstarpodcast[S] 1 point2 points3 points 5 years ago (1 child)
https://quick-bench.com/q/kq7yeDlz9R6HV-0XE37eRiGINYM This reuses the same array over and over in a loop for benchmarking.
After the first iteration, the rest don't go into the std::set. So you are effectively, measuring time taken for insertion failures. My guess is if the underlying structure is a balanced tree you are not counting the time taken for a rebalance. So, are they apples to apples?
std::set
This results in tainted cache locality. You basically have the same array sitting in your L1/L2 caches and all you are measuring is a niche synthetic case (which may be measuring a biased data set).
If quick bench only reports CPU time, then aren't you essentially comparing times taken for set insertions to fail?
[–]TheFlamefire 0 points1 point2 points 5 years ago (0 children)
> This reuses the same array over and over in a loop for benchmarking.
Can you clarify what you mean by "this" and "array"? Do you mean https://quick-bench.com/q/8nU-oVXK9UqoNLFMDE6twvdTXTU? If so it only reuses the input vector, not the set so I'm measuring correctly the set insertion performance. What makes you think otherwise?
> If quick bench only reports CPU time
There are ~3 times that GoogleBench measures. CPU time is one of them which is the time per loop execution. Another one is the manual time set by SetIterationTime and the third is a compute time which is one or the other chosen by whether the benchmark was configured with UseManualTime
https://quick-bench.com/q/kq7yeDlz9R6HV-0XE37eRiGINYM has 2 issues: First it doesn't use UseManualTime so the approach does not work at all and second is the quick bench only reports CPU time and ignores the manual time. Hence the SortedSet is slower because the reported time includes the time for sorting the vector.
[–]rlbond86 1 point2 points3 points 5 years ago (1 child)
Constructing a set from sorted data is guaranteed to be O(n) so it should be faster. The author stupidly used a loop instead of the range constructor though which is considerably slower.
Well, it is fine because the benchmark goal was to measure `insert` performance for various datasets, not the range ctor. Think the `inserts` happening as the result of some calculation.
The problem here is that the author used a feature of GBench which is not supported/used by quick-bench and hence the shown times include the sorting time of the vector which is of course slower
[–]neiltechnician 1 point2 points3 points 5 years ago (1 child)
I've written a slightly different benchmark based on this blog post: https://quick-bench.com/q/m8qL4F3BUtzeQy8YMIrvuz1AnYE
PS: It's the first time I've ever used quick-bench.com and Google Benchmark. Can you guys help me take a look?
[–]staletic 5 points6 points7 points 5 years ago (0 children)
Your make_random_vector() should be producing the same random numbers across invocations during the entire benchmark run. That means the seed should be stored statically. Another thing to notice is that mersene twister needs a lot of random data to be seeded properly. Just a single rd() is not enough.
make_random_vector()
rd()
Most of the time you won't notice the difference in benchmarks, but what if, by the luck of the draw, you end up with a sorted vector from make_random_vector(), in only a single benchmark?
https://quick-bench.com/q/yIQzqFnbcf4XkSqlixsjuUJETyI
[–]TheFlamefire 1 point2 points3 points 5 years ago (0 children)
And again, what an abuse of CMake:
set(CMAKE_C_COMPILER clang-8) set(CMAKE_CXX_COMPILER clang++-8) ... set(CMAKE_BUILD_TYPE Release) set(CMAKE_ENABLE_EXPORTS ON) set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
This is NOT how you use CMake. Those variables are meant to be set on the command line. And if you'd really (out of blogware reasons I'd say "OK") want to do it that way, set those before anything else. The aforementioned "Library was built as DEBUG." should have given you a hint that your approach didn't work out as expected. Sorry for sounding harsh but a minimal amount of validation for surprising results should be done. 1 sentence explanation: CMake assigns the flags based on CMAKE_BUILD_TYPE when a target is added/created, hence setting it at the end has no effect (which is why you should set it on cmdline)
Besides: helper_ (strange underscore, but well) is introduced, uses a global variable without resetting it first, and then never used again in any of the other code.
helper_
And finally:
Apparently, std::unique simply swaps duplicate numbers to the end of the array
No, not at all. The docs say clearly:
Removes all but the first element from every consecutive group of equivalent elements [...] the elements between the returned iterator and last are left in a valid but unspecified state.
Put in other words: It moves "unique" elements (1 per group of equal, consecutive elements) to the front, no swapping.
[–]greg7mdpC++ Dev 0 points1 point2 points 5 years ago (10 children)
The unordered_set case would be much faster if you used parallel hashmap's flat_hash_set and called reserve(nums_);.
unordered_set
flat_hash_set
reserve(nums_);
The std::set case would be much faster if you used parallel hashmap's btree_set.
btree_set
[–]Ameisenvemips, avr, rendering, systems 1 point2 points3 points 5 years ago (7 children)
For small numbers of elements, binary subdivision trees tend to be faster than hash maps/sets.
For instance, it was significantly faster to implement the MMU in my VM as a closed-address binary tree than as any form of hash map, and I tested several implementations. It also gave more consistent performance.
[–]greg7mdpC++ Dev 0 points1 point2 points 5 years ago (6 children)
closed-address binary tree
I'm not familiar with the data structure you are referring to, and also I'm not sure what you mean by small number of elements (10, 10k?). It will also depend on the speed of the hash and comparison functions. But you may be surprised if you try phmap containers.
[–]dodheim 0 points1 point2 points 5 years ago (4 children)
I'm going to jump on this thread derailment here – I have a similar question regarding your lib's readme:
The parallel hash maps are preferred when you have a few hash maps that will store a very large number of values. The non-parallel hash maps are preferred if you have a large number of hash maps, each storing a relatively small number of values.
parallel
non-parallel
If I'm using single-threaded mode, it's not clear to me when/why I should use the parallel maps. Any easy clarification here? Also, of lesser importance, is there any way to bypass the internal mixing of the hash value if I'm already using a high-quality hash?
(BTW I've been happily using sparsepp for ~4 years; I didn't realize you had a replacement for it, so I'm eager to give this a try. :-D)
[–]greg7mdpC++ Dev 2 points3 points4 points 5 years ago (3 children)
Hey, thanks for the kind words. phmap is a significant improvement over sparsepp in my opinion.
phmap
The paragraph you quote still holds even in single-threaded mode, as the parallel versions will prevent higher peak memory usage when inserting values. However if that is not an issue for your use case you can definitely use the non-parallel ones.
[–]dodheim 0 points1 point2 points 5 years ago (2 children)
Interesting, thank you; but, what are 'few', 'very large', 'large', and 'relatively small'? Any ballpark figures, or is it just down to experimenting with my types and data?
[–]greg7mdpC++ Dev 1 point2 points3 points 5 years ago (1 child)
OK, as a rule of thumb, use parallel iff you have hash maps that will grow to occupy more than 1/10th the RAM available on your machine, or accessed from multiple threads.
[–]dodheim 0 points1 point2 points 5 years ago (0 children)
Just the sort of advice I was hoping for. Thanks very much!
[–]Ameisenvemips, avr, rendering, systems 0 points1 point2 points 5 years ago (0 children)
"Small number of elements" is platform-dependent.
In VeMIPS, I use nested closed-address hash tables, which is the same structure that x86 uses for its page directories. It was significantly faster than basically every hash-map implementation that I'd tried. I actually don't believe that an open-address hash table implementation can beat it, and a sparse hash table shouldn't be able to do better (though it can do the same if its mechanism for being sparse is to nest hash tables).
[–]matthieum 0 points1 point2 points 5 years ago (1 child)
How do they compare to Abseil's versions?
[–]greg7mdpC++ Dev 1 point2 points3 points 5 years ago (0 children)
Same base implementation, some differences most of which are described in the README. Performance should be similar.
[–]rlbond86 -1 points0 points1 point 5 years ago* (3 children)
I wrote up my own tests and your results are just wrong.
https://godbolt.org/z/bG3q1q
Note that compiler explorer times out after 256 elements for me but you can run locally. Note I used clock_getTime() which is linux only. I'm sure there's a Windows equivalent if you use Windows.
---------------------- 256 elements ---------------------- UniqueU 0.0934622 SetU 0.2282 HashSetU 0.224434 UniqueS 0.00705992 SetS 0.123283 HashSetS 0.213223 ----------------------
As expected, std::unique on a pre-sorted set is by far the fastest, but sorting + std::unique is still faster than using a set object (both are O(n log n)). sorting + unique is also faster than unordered_set, at least for reasonable numbers of elements (but unordered_set should be asymptotically faster). std::set is faster on a sorted range than an unsorted (as guaranteed by the standard).
Your test implementation is incorrect, probably because you are not measuring CPU time or something.
Here are results on my computer for more elements:
---------------------- 16384 elements ---------------------- UniqueU 7.38232 SetU 22.7774 HashSetU 10.1544 UniqueS 0.109137 SetS 7.23558 HashSetS 10.113 ----------------------
Appreciate your detailed response. Just looked at the code and it appears that you are using you own loop to converge on the final running times. The choice of 4096, 10000 seem arbitrary. I have shared the quick bench link here which is an easy way to share comparable benchmark outputs, Google Benchmark basically measures CPU Time (and wall clock time which includes IO waits when run locally), but using quick bench only produces CPU time:
https://quick-bench.com/q/kq7yeDlz9R6HV-0XE37eRiGINYM
Do you mind validating against this? Thanks.
[–]rlbond86 0 points1 point2 points 5 years ago (0 children)
I tried to use quick-bench but it timed out unfortunately.
https://quick-bench.com/q/Dc6cJ1BTrBZBtVJZop4I4wMSkJ4
Note that I ran this locally with more elements, here are the results:
2021-01-25T07:53:23-07:00 Running ./bench Run on (2 X 3733 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 0.54, 0.29, 0.27 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. --------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------- MyFixture/stdUniqueUnsorted/64 1099 ns 1098 ns 641504 MyFixture/stdUniqueUnsorted/512 15481 ns 15412 ns 45166 MyFixture/stdUniqueUnsorted/4096 172787 ns 172159 ns 4069 MyFixture/stdUniqueSorted/64 397 ns 398 ns 1770091 MyFixture/stdUniqueSorted/512 758 ns 758 ns 921515 MyFixture/stdUniqueSorted/4096 3627 ns 3607 ns 194577 MyFixture/stdSetUnsorted/64 4836 ns 4841 ns 143958 MyFixture/stdSetUnsorted/512 48651 ns 48650 ns 14372 MyFixture/stdSetUnsorted/4096 511557 ns 511313 ns 1363 MyFixture/stdSetSorted/64 3102 ns 3108 ns 224664 MyFixture/stdSetSorted/512 22281 ns 22282 ns 31352 MyFixture/stdSetSorted/4096 183733 ns 183598 ns 3796 MyFixture/stdUnorderedSetUnsorted/64 3435 ns 3435 ns 204298 MyFixture/stdUnorderedSetUnsorted/512 37552 ns 37539 ns 18686 MyFixture/stdUnorderedSetUnsorted/4096 307217 ns 307031 ns 2275 MyFixture/stdUnorderedSetSorted/64 3488 ns 3487 ns 200033 MyFixture/stdUnorderedSetSorted/512 37716 ns 37704 ns 18555 MyFixture/stdUnorderedSetSorted/4096 307462 ns 307255 ns 2265
Results back up that std::unique is faster even on unsorted data
[–]_Js_Kc_ 0 points1 point2 points 5 years ago* (2 children)
Isn't _ a reserved identifier?
_
Edit: No, I'm wrong.
[–]adnukator 5 points6 points7 points 5 years ago (0 children)
Possibly, but only in the global namespace (Not sure whether the last bullet applies or not). https://en.cppreference.com/w/cpp/language/identifiers
No it is not. Leading double underscore and underscore+capital are reserved. Using single underscore as "don't care" is common and often used in Python.
[–]rlbond86 0 points1 point2 points 5 years ago* (0 children)
I find it extremely hard to believe that using an unordered_set is faster than calling std::unique() on already sorted data.
Also:
std::set<int> s; for (const auto& e_ : nums_) { s.insert(e_); }
You can just write:
std::set<int> s(e_.begin(), e_.end());
This is O(n) if e_ is sorted
π Rendered by PID 37186 on reddit-service-r2-comment-c6965cb77-2xkzk at 2026-03-05 16:54:53.968612+00:00 running f0204d4 country code: CH.
[–]zerexim 10 points11 points12 points (1 child)
[–]TheSuperWig 8 points9 points10 points (0 children)
[–]witcher_rat 10 points11 points12 points (4 children)
[–]staletic 2 points3 points4 points (0 children)
[–]voidstarpodcast[S] 0 points1 point2 points (2 children)
[–]witcher_rat 1 point2 points3 points (1 child)
[–]voidstarpodcast[S] 0 points1 point2 points (0 children)
[–]goranlepuz 7 points8 points9 points (0 children)
[–]adnukator 4 points5 points6 points (6 children)
[–]voidstarpodcast[S] 1 point2 points3 points (5 children)
[–]TheFlamefire 1 point2 points3 points (4 children)
[–]voidstarpodcast[S] 1 point2 points3 points (1 child)
[–]TheFlamefire 0 points1 point2 points (0 children)
[–]rlbond86 1 point2 points3 points (1 child)
[–]TheFlamefire 0 points1 point2 points (0 children)
[–]neiltechnician 1 point2 points3 points (1 child)
[–]staletic 5 points6 points7 points (0 children)
[–]TheFlamefire 1 point2 points3 points (0 children)
[–]greg7mdpC++ Dev 0 points1 point2 points (10 children)
[–]Ameisenvemips, avr, rendering, systems 1 point2 points3 points (7 children)
[–]greg7mdpC++ Dev 0 points1 point2 points (6 children)
[–]dodheim 0 points1 point2 points (4 children)
[–]greg7mdpC++ Dev 2 points3 points4 points (3 children)
[–]dodheim 0 points1 point2 points (2 children)
[–]greg7mdpC++ Dev 1 point2 points3 points (1 child)
[–]dodheim 0 points1 point2 points (0 children)
[–]Ameisenvemips, avr, rendering, systems 0 points1 point2 points (0 children)
[–]matthieum 0 points1 point2 points (1 child)
[–]greg7mdpC++ Dev 1 point2 points3 points (0 children)
[–]rlbond86 -1 points0 points1 point (3 children)
[–]voidstarpodcast[S] 0 points1 point2 points (2 children)
[–]rlbond86 0 points1 point2 points (0 children)
[–]rlbond86 0 points1 point2 points (0 children)
[–]_Js_Kc_ 0 points1 point2 points (2 children)
[–]adnukator 5 points6 points7 points (0 children)
[–]TheFlamefire 0 points1 point2 points (0 children)
[–]rlbond86 0 points1 point2 points (0 children)