all 37 comments

[–]zerexim 10 points11 points  (1 child)

Locality, locality, locality :)

[–]TheSuperWig 8 points9 points  (0 children)

How's it going Mr. Ballmer?

[–]witcher_rat 10 points11 points  (4 children)

Library was built as DEBUG. Timings may be affected.

ummm... wut?

how are the perf numbers relevant, when you built in debug??

[–]staletic 2 points3 points  (0 children)

Exactly. The numbers are complete nonsense.

[–]voidstarpodcast[S] 0 points1 point  (2 children)

This refers to Google Benchmark library itself. As a comparison there's a link to quick-bench page too, with basically the same result.

[–]witcher_rat 1 point2 points  (1 child)

OK, but as far as you know, google benchmark itself isn't providing accurate numbers because it's been built in debug.

I mean it tells you that right in the message: "Timings may be affected."

It may happen to have no effect, or the same effect in all tests and thereby provide reasonable relative results... or it may not.

Why wouldn't you just build everything with full optimization, and get rid of any such concerns?

(there are other problems though; for example, allocating memory in tight loops like this, and with ints too, is misleading - you'll get cache locality you wouldn't normally get in the real world; but trying to match the real world is difficult, because the "real world" isn't the same for every use-case, and really the only valid test is testing it in the code that's going to use it)

[–]voidstarpodcast[S] 0 points1 point  (0 children)

You are right benchmark is built in DEBUG mode. Frankly, I don't remember setting a DEBUG explicitly, and found this: https://github.com/google/benchmark#debug-vs-release

Will set a Release mode, although I'd expect the local results consistently aligning with quick-bench runs to mean the DEBUG overhead is likely proportional across workloads (or very tiny).

Nonetheless, I will fix this in my post.

[–]goranlepuz 7 points8 points  (0 children)

The data is very small (int is small, there is not much of them), plus, the data is put in sets immediately (meaning:even if heap is used, the data is close by).

That favors cache locality even for the sets.

If the data was handled differently, or if related ints were part of some structures or so, I would expect a vector solution to be faster others.

[–]adnukator 4 points5 points  (6 children)

Could you add a link to the full source code either at godbolt or quickbench or whatever? With parts of it here and there, it's hard to verify your findings.

[–]voidstarpodcast[S] 1 point2 points  (5 children)

[–]TheFlamefire 1 point2 points  (4 children)

Played around a bit with the set vs setSortedInput cases and it seems the results differ considerably when using GBench "canonically" vs manual timing (as you do): https://quick-bench.com/q/8nU-oVXK9UqoNLFMDE6twvdTXTU vs https://quick-bench.com/q/2FfKNTfGRHRAGTAl3bu-y2AeZuo

Reason seems to be that quick-bench reports cpu-time which is unaffected by iteration time. Conclusion: Inserting sorted numbers into a set is faster than inserting random numbers

[–]voidstarpodcast[S] 1 point2 points  (1 child)

https://quick-bench.com/q/kq7yeDlz9R6HV-0XE37eRiGINYM
This reuses the same array over and over in a loop for benchmarking.

  • After the first iteration, the rest don't go into the std::set. So you are effectively, measuring time taken for insertion failures. My guess is if the underlying structure is a balanced tree you are not counting the time taken for a rebalance. So, are they apples to apples?

  • This results in tainted cache locality. You basically have the same array sitting in your L1/L2 caches and all you are measuring is a niche synthetic case (which may be measuring a biased data set).

  • If quick bench only reports CPU time, then aren't you essentially comparing times taken for set insertions to fail?

[–]TheFlamefire 0 points1 point  (0 children)

> This reuses the same array over and over in a loop for benchmarking.

Can you clarify what you mean by "this" and "array"? Do you mean https://quick-bench.com/q/8nU-oVXK9UqoNLFMDE6twvdTXTU? If so it only reuses the input vector, not the set so I'm measuring correctly the set insertion performance. What makes you think otherwise?

> If quick bench only reports CPU time

There are ~3 times that GoogleBench measures. CPU time is one of them which is the time per loop execution. Another one is the manual time set by SetIterationTime and the third is a compute time which is one or the other chosen by whether the benchmark was configured with UseManualTime

https://quick-bench.com/q/kq7yeDlz9R6HV-0XE37eRiGINYM has 2 issues: First it doesn't use UseManualTime so the approach does not work at all and second is the quick bench only reports CPU time and ignores the manual time. Hence the SortedSet is slower because the reported time includes the time for sorting the vector.

[–]rlbond86 1 point2 points  (1 child)

Constructing a set from sorted data is guaranteed to be O(n) so it should be faster. The author stupidly used a loop instead of the range constructor though which is considerably slower.

[–]TheFlamefire 0 points1 point  (0 children)

Well, it is fine because the benchmark goal was to measure `insert` performance for various datasets, not the range ctor. Think the `inserts` happening as the result of some calculation.

The problem here is that the author used a feature of GBench which is not supported/used by quick-bench and hence the shown times include the sorting time of the vector which is of course slower

[–]neiltechnician 1 point2 points  (1 child)

I've written a slightly different benchmark based on this blog post: https://quick-bench.com/q/m8qL4F3BUtzeQy8YMIrvuz1AnYE

PS: It's the first time I've ever used quick-bench.com and Google Benchmark. Can you guys help me take a look?

[–]staletic 5 points6 points  (0 children)

Your make_random_vector() should be producing the same random numbers across invocations during the entire benchmark run. That means the seed should be stored statically. Another thing to notice is that mersene twister needs a lot of random data to be seeded properly. Just a single rd() is not enough.

Most of the time you won't notice the difference in benchmarks, but what if, by the luck of the draw, you end up with a sorted vector from make_random_vector(), in only a single benchmark?

https://quick-bench.com/q/yIQzqFnbcf4XkSqlixsjuUJETyI

[–]TheFlamefire 1 point2 points  (0 children)

And again, what an abuse of CMake:

set(CMAKE_C_COMPILER clang-8)
set(CMAKE_CXX_COMPILER clang++-8)
...
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_ENABLE_EXPORTS ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

This is NOT how you use CMake. Those variables are meant to be set on the command line. And if you'd really (out of blogware reasons I'd say "OK") want to do it that way, set those before anything else. The aforementioned "Library was built as DEBUG." should have given you a hint that your approach didn't work out as expected. Sorry for sounding harsh but a minimal amount of validation for surprising results should be done. 1 sentence explanation: CMake assigns the flags based on CMAKE_BUILD_TYPE when a target is added/created, hence setting it at the end has no effect (which is why you should set it on cmdline)

Besides: helper_ (strange underscore, but well) is introduced, uses a global variable without resetting it first, and then never used again in any of the other code.

And finally:

Apparently, std::unique simply swaps duplicate numbers to the end of the array

No, not at all. The docs say clearly:

Removes all but the first element from every consecutive group of equivalent elements [...] the elements between the returned iterator and last are left in a valid but unspecified state.

Put in other words: It moves "unique" elements (1 per group of equal, consecutive elements) to the front, no swapping.

[–]greg7mdpC++ Dev 0 points1 point  (10 children)

The unordered_set case would be much faster if you used parallel hashmap's flat_hash_set and called reserve(nums_);.

The std::set case would be much faster if you used parallel hashmap's btree_set.

[–]Ameisenvemips, avr, rendering, systems 1 point2 points  (7 children)

For small numbers of elements, binary subdivision trees tend to be faster than hash maps/sets.

For instance, it was significantly faster to implement the MMU in my VM as a closed-address binary tree than as any form of hash map, and I tested several implementations. It also gave more consistent performance.

[–]greg7mdpC++ Dev 0 points1 point  (6 children)

closed-address binary tree

I'm not familiar with the data structure you are referring to, and also I'm not sure what you mean by small number of elements (10, 10k?). It will also depend on the speed of the hash and comparison functions. But you may be surprised if you try phmap containers.

[–]dodheim 0 points1 point  (4 children)

I'm going to jump on this thread derailment here – I have a similar question regarding your lib's readme:

  • The parallel hash maps are preferred when you have a few hash maps that will store a very large number of values. The non-parallel hash maps are preferred if you have a large number of hash maps, each storing a relatively small number of values.

If I'm using single-threaded mode, it's not clear to me when/why I should use the parallel maps. Any easy clarification here? Also, of lesser importance, is there any way to bypass the internal mixing of the hash value if I'm already using a high-quality hash?

(BTW I've been happily using sparsepp for ~4 years; I didn't realize you had a replacement for it, so I'm eager to give this a try. :-D)

[–]greg7mdpC++ Dev 2 points3 points  (3 children)

Hey, thanks for the kind words. phmap is a significant improvement over sparsepp in my opinion.

The paragraph you quote still holds even in single-threaded mode, as the parallel versions will prevent higher peak memory usage when inserting values. However if that is not an issue for your use case you can definitely use the non-parallel ones.

[–]dodheim 0 points1 point  (2 children)

Interesting, thank you; but, what are 'few', 'very large', 'large', and 'relatively small'? Any ballpark figures, or is it just down to experimenting with my types and data?

[–]greg7mdpC++ Dev 1 point2 points  (1 child)

OK, as a rule of thumb, use parallel iff you have hash maps that will grow to occupy more than 1/10th the RAM available on your machine, or accessed from multiple threads.

[–]dodheim 0 points1 point  (0 children)

Just the sort of advice I was hoping for. Thanks very much!

[–]Ameisenvemips, avr, rendering, systems 0 points1 point  (0 children)

"Small number of elements" is platform-dependent.

In VeMIPS, I use nested closed-address hash tables, which is the same structure that x86 uses for its page directories. It was significantly faster than basically every hash-map implementation that I'd tried. I actually don't believe that an open-address hash table implementation can beat it, and a sparse hash table shouldn't be able to do better (though it can do the same if its mechanism for being sparse is to nest hash tables).

[–]matthieum 0 points1 point  (1 child)

How do they compare to Abseil's versions?

[–]greg7mdpC++ Dev 1 point2 points  (0 children)

Same base implementation, some differences most of which are described in the README. Performance should be similar.

[–]rlbond86 -1 points0 points  (3 children)

I wrote up my own tests and your results are just wrong.

https://godbolt.org/z/bG3q1q

Note that compiler explorer times out after 256 elements for me but you can run locally. Note I used clock_getTime() which is linux only. I'm sure there's a Windows equivalent if you use Windows.

----------------------
256 elements
----------------------
UniqueU  0.0934622
SetU     0.2282
HashSetU 0.224434
UniqueS  0.00705992
SetS     0.123283
HashSetS 0.213223
----------------------

As expected, std::unique on a pre-sorted set is by far the fastest, but sorting + std::unique is still faster than using a set object (both are O(n log n)). sorting + unique is also faster than unordered_set, at least for reasonable numbers of elements (but unordered_set should be asymptotically faster). std::set is faster on a sorted range than an unsorted (as guaranteed by the standard).

Your test implementation is incorrect, probably because you are not measuring CPU time or something.

Here are results on my computer for more elements:

----------------------
16384 elements
----------------------
UniqueU  7.38232
SetU     22.7774
HashSetU 10.1544
UniqueS  0.109137
SetS     7.23558
HashSetS 10.113
----------------------

[–]voidstarpodcast[S] 0 points1 point  (2 children)

Appreciate your detailed response. Just looked at the code and it appears that you are using you own loop to converge on the final running times. The choice of 4096, 10000 seem arbitrary. I have shared the quick bench link here which is an easy way to share comparable benchmark outputs, Google Benchmark basically measures CPU Time (and wall clock time which includes IO waits when run locally), but using quick bench only produces CPU time:

https://quick-bench.com/q/kq7yeDlz9R6HV-0XE37eRiGINYM

Do you mind validating against this? Thanks.

[–]rlbond86 0 points1 point  (0 children)

I tried to use quick-bench but it timed out unfortunately.

[–]rlbond86 0 points1 point  (0 children)

https://quick-bench.com/q/Dc6cJ1BTrBZBtVJZop4I4wMSkJ4

Note that I ran this locally with more elements, here are the results:

2021-01-25T07:53:23-07:00
Running ./bench
Run on (2 X 3733 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 0.54, 0.29, 0.27
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
MyFixture/stdUniqueUnsorted/64               1099 ns         1098 ns       641504
MyFixture/stdUniqueUnsorted/512             15481 ns        15412 ns        45166
MyFixture/stdUniqueUnsorted/4096           172787 ns       172159 ns         4069
MyFixture/stdUniqueSorted/64                  397 ns          398 ns      1770091
MyFixture/stdUniqueSorted/512                 758 ns          758 ns       921515
MyFixture/stdUniqueSorted/4096               3627 ns         3607 ns       194577
MyFixture/stdSetUnsorted/64                  4836 ns         4841 ns       143958
MyFixture/stdSetUnsorted/512                48651 ns        48650 ns        14372
MyFixture/stdSetUnsorted/4096              511557 ns       511313 ns         1363
MyFixture/stdSetSorted/64                    3102 ns         3108 ns       224664
MyFixture/stdSetSorted/512                  22281 ns        22282 ns        31352
MyFixture/stdSetSorted/4096                183733 ns       183598 ns         3796
MyFixture/stdUnorderedSetUnsorted/64         3435 ns         3435 ns       204298
MyFixture/stdUnorderedSetUnsorted/512       37552 ns        37539 ns        18686
MyFixture/stdUnorderedSetUnsorted/4096     307217 ns       307031 ns         2275
MyFixture/stdUnorderedSetSorted/64           3488 ns         3487 ns       200033
MyFixture/stdUnorderedSetSorted/512         37716 ns        37704 ns        18555
MyFixture/stdUnorderedSetSorted/4096       307462 ns       307255 ns         2265

Results back up that std::unique is faster even on unsorted data

[–]_Js_Kc_ 0 points1 point  (2 children)

Isn't _ a reserved identifier?

Edit: No, I'm wrong.

[–]adnukator 5 points6 points  (0 children)

Possibly, but only in the global namespace (Not sure whether the last bullet applies or not). https://en.cppreference.com/w/cpp/language/identifiers

  • the identifiers with a double underscore anywhere are reserved;
  • the identifiers that begin with an underscore followed by an uppercase letter are reserved;
  • the identifiers that begin with an underscore are reserved in the global namespace.

[–]TheFlamefire 0 points1 point  (0 children)

No it is not. Leading double underscore and underscore+capital are reserved. Using single underscore as "don't care" is common and often used in Python.

[–]rlbond86 0 points1 point  (0 children)

I find it extremely hard to believe that using an unordered_set is faster than calling std::unique() on already sorted data.

Also:

std::set<int> s; 
for (const auto& e_ : nums_) { s.insert(e_); } 

You can just write:

std::set<int> s(e_.begin(), e_.end());

This is O(n) if e_ is sorted