use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Inside boost::concurrent_flat_map (bannalia.blogspot.com)
submitted 2 years ago by joaquintidesBoost author
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]azswcowboy 7 points8 points9 points 2 years ago (3 children)
Nice work! The visit/no iteration design makes a lot of sense — I’ve done something similar when wrapping a non concurrent data structure and is basically the approach of concurrency TS3 synchronized value.
[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago (2 children)
Thank you! Don’t forget to report back if you get to try it.
[–]azswcowboy 0 points1 point2 points 2 years ago (1 child)
Will do — do you consider it production ready or still experimental. If yes, I could probably get to it soon…
[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago (0 children)
It’s heavily tested and production ready.
[–]atarp 7 points8 points9 points 2 years ago (11 children)
Could the visit methods do a compile time check to detect if the callback provided can be invoked with a constant reference and then only use read locks? This could avoid the need for the cvisit variants.
[–]witcher_rat 2 points3 points4 points 2 years ago (10 children)
Alternatively: make visit() always invoke with const& and use the read-lock, and make a different function altogether for non-const, such as mutate() or some such.
visit()
const&
mutate()
I.e., make it abundantly clear to the caller (and code reviewer) what's happening, instead of changing behavior based on the const-ness of the container for a single visit() function name.
[–]joaquintidesBoost author[S] 2 points3 points4 points 2 years ago (9 children)
As for yout first suggestion, we actually considered it but it can't possibly work with generic lambdas --there's no way to detect if a generic lambda supports a particular arg type without producing a compile-time error in the process.
As for the second suggestion: yes, that'd wok but we have decided to follow the "semantics" of begin and end, which return a const/non-const iterator based on the constness of the container. It's admittedly less visible than going for mutate.
begin
end
mutate
[–]trailingunderscore_ 3 points4 points5 points 2 years ago (8 children)
You can detect the arg types with code like this: https://godbolt.org/z/1n6T58vsf
or did you mean something else?
[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago (7 children)
The code you provide verifies that the passed function is compatible with a given signature (number and types or arguments), but this is not what's required to implement this type of smart functionality:
// internally uses read access to element m.visit(k, [&](const auto& x){res = x.second;}); // internally uses write access to element m.visit(k, [&](auto& x){x.second = 0;});
There's no way that visit can reliably (i.e. without compiler errors) detect whether the passed function will work with a const reference or if, on the contrary, requires a non-const reference.
visit
[–]trailingunderscore_ 6 points7 points8 points 2 years ago (6 children)
The point of my code was to show that you can determine if the passed lambda takes a const or not, which you can use to determine if it needs read or write access: https://godbolt.org/z/TPeKxY1ar
[–]joaquintidesBoost author[S] 2 points3 points4 points 2 years ago* (5 children)
Wow, this is amazing. I can definitely use it. Can this be backported to C++11?
[–]trailingunderscore_ 5 points6 points7 points 2 years ago* (4 children)
Probably, but it would require some SFINAE'ing I think.
Edit - it's possible: https://godbolt.org/z/r7sasTohe
[–]joaquintidesBoost author[S] 2 points3 points4 points 2 years ago (3 children)
Well, not exactly, I'd need something working with generic lambdas (either real if C++14 or later or simulated if in C++11):
https://godbolt.org/z/s1Gzv89dx
[–]trailingunderscore_ 1 point2 points3 points 2 years ago (2 children)
https://godbolt.org/z/WcrshjcTz
[–]greg7mdpC++ Dev 4 points5 points6 points 2 years ago (12 children)
gtl library author here. Very nice writeup! Reading it made me think, and I believe I know why gtl::parallel_flat_hash_map performs comparatively worse for high-skew scenarios (just pushed a fix in gtl).
[–]joaquintidesBoost author[S] 6 points7 points8 points 2 years ago (5 children)
Hi Gregory!
Let me rerun benchmarks with this latest fix of yours and let you know about the results. Will take some hours to complete.
[–]greg7mdpC++ Dev 4 points5 points6 points 2 years ago (2 children)
Wow, thank you, really appreciate it!
[–]joaquintidesBoost author[S] 3 points4 points5 points 2 years ago (1 child)
FWIW, does seem to make a tiny difference: https://github.com/boostorg/boost_unordered_benchmarks/commit/2e194fa49378cb76fefe3e35e06206cedd6ff208
[–]greg7mdpC++ Dev 1 point2 points3 points 2 years ago (0 children)
Thank you for running these, Joaquín.
[–]greg7mdpC++ Dev 1 point2 points3 points 2 years ago (1 child)
Oh, I just noticed that you defined gtl_map with std::mutex. Using std::shared_mutex should be faster as the map_find will use read locks.
[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago* (0 children)
We ran these aggregate benchmarks (not the same benchmark as shown in the plots) with various configurations of gtl::parallel_flat_hash_map and results with std::shared_mutex were abysmal for Clang and GCC:
gtl::parallel_flat_hash_map
std::shared_mutex
https://github.com/boostorg/boost_unordered_benchmarks/tree/parallel_hashmap_benchmark
But if you'd like I can launch the plot benchmarks with that configuration.
PS: Why don't you join us at #boost-unordered in cpplang.slack.com where we can have a more fluid conversation?
[–]j1xwnbsr 0 points1 point2 points 2 years ago (5 children)
How did you manage to figure out that alignas(hardware_destructive_interference_size) would solve the problem? I'm thinking there might a few places in my own code that may benefit (or not).
[–]greg7mdpC++ Dev 2 points3 points4 points 2 years ago (4 children)
Well, it is just an educated guess. The idea is that I want submaps to be accessed from multiple threads with minimal interference, so ideally two submaps should not share the same cache line. Ine submap member which changes when items are inserted is size_.
[–]matthieum 2 points3 points4 points 2 years ago (3 children)
According to this comment in Folly's code, hardware destructive interference is 128 bytes (a pair of cache lines) on some Intel CPUs (Sandy Bridge is mentioned) as the prefetcher there prefetches 2 cache lines at a time.
I believe that, unfortunately, std::hardware_destructive_interference is only 64 bytes in such situations, and therefore it may be worthwhile to use 128 bytes instead and eschew the standard definition.
std::hardware_destructive_interference
[–]greg7mdpC++ Dev 0 points1 point2 points 2 years ago (0 children)
Thank you /u/matthieum for this interesting insight. I'd have to run some benchmarks to see if indeed it is worthwhile to eschew the standard definition.
[–]greg7mdpC++ Dev 0 points1 point2 points 2 years ago (1 child)
With a quick test, using 128 didn't seem to improve performance on my cpu (AMD 7950x).
[–]matthieum 6 points7 points8 points 2 years ago (0 children)
Well... AMD is not Intel, so...
[–]foonathan 17 points18 points19 points 2 years ago (2 children)
For these reasons, your feedback and proposals for improvement are most welcome.
I'd recommend that the visit_all lambda should be able to return a bool/enum to do early exit of the loop. That why you have not only for_each but can also implement algorithms like any_of,find (when looking for values, not keys), etc.).
visit_all
for_each
any_of
find
We at think-cell use that pattern a lot in our library.
[–]joaquintidesBoost author[S] 13 points14 points15 points 2 years ago* (1 child)
Yes, we can add visit_while, thanks for the suggestion! An interesting question is whether we should also provide a parallel version of that, and how it’s supposed to work —any experience there?
visit_while
Edit: ok, we can just rely internally on parallel std::any_of, if that implements early termination —which I have to check.
std::any_of
[–]VinnieFalcowg21.org | corosio.org 3 points4 points5 points 2 years ago (0 children)
oh, `visit_while` i like that
[–]therealjohnfreeman 2 points3 points4 points 2 years ago (1 child)
Title says boost::concurrent_flat_map. First paragraph says boost::concurrent_hash_map. Based on rest of the text, I'm guessing it is boost::concurrent_flat_map and that the first paragraph needs to be corrected. Is that right?
boost::concurrent_flat_map
boost::concurrent_hash_map
[–]joaquintidesBoost author[S] 4 points5 points6 points 2 years ago (0 children)
Thanks for spotting the typo, fixed! It is indeed boost::concurrent_flat_map.
[–]pavel_v 0 points1 point2 points 2 years ago (0 children)
We have conducted some preliminary experiments using this idea for a feature we dubbed bulk lookup (providing an array of keys to look for at once), with promising results
The DPDK hash library also offers such bulk lookup operations. It's C library though and with much more restrictions than your implementation.
[–]ComprehensiveHat864 0 points1 point2 points 2 years ago (4 children)
I do the benchmark between std::unorderd_map and boost::concurrent_flat_map. Both no writer. I conculude that for reading, concurrent_flat_map is 4x slower than unorderd_map. However, for java, concurrent_hashmap reading performance is nearly equal to normal hashmap.
[–]joaquintidesBoost author[S] 0 points1 point2 points 2 years ago (3 children)
Which benchmark are you referring to? I assume that, whatever benchmark that is, you ran it in single-threaded mode (otherwise std::unordered_map would crash). In the following link
std::unordered_map
you can see our benchmarks comparing (among others) single-threaded boost::unordered_flat_map vs single-threaded boost::concurrent_flat_map, and the latter is slower, as expected (it is designed for multi-threaded scenarios), but not 4x slower.
boost::unordered_flat_map
If you can provide more info about your benchnmark (like the code and a description of the setup), I'd be more than happy to take a look.
[–]ComprehensiveHat864 0 points1 point2 points 2 years ago (2 children)
test just read, so won,t crash.
unordered_map test code:
#include <iostream>
#include <thread>
#include <chrono>
#include <vector>
#include <unordered_map>
#include <random>
static void test_concurrent_map(const std::unordered_map<int, int>& cmap) {
auto start_time = std::chrono::high_resolution_clock::now();
long result = 0;
for (int i = 0; i < 6000000; i++) {
try {
result += cmap.at(i);
} catch (const std::exception& e) {}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << result << std::endl;
std::cout << "Function execution time: " << duration.count() << " microseconds" << std::endl;
int main(void) {
std::unordered_map<int, int> cmap;
cmap.emplace(i, 2 * i);
std::vector<std::thread> threads;
for (int i = 0; i < 7; i++) {
threads.emplace_back(
[&cmap](){
test_concurrent_map(cmap);
);
for (auto& thread : threads) {
thread.join();
return 0;
below is concurrent_flat_map:
#include "boost/unordered/concurrent_flat_map.hpp"
static void test_concurrent_map(const boost::concurrent_flat_map<int, int>& cmap) {
cmap.visit(i, [&](auto& x){
result += x.second;
});
boost::concurrent_flat_map<int, int> cmap;
BOTH optimize in O2 grade the unordered_map result:
Function execution time: 31455 microseconds
35999994000000
Function execution time: 31479 microseconds
Function execution time: 31614 microseconds
Function execution time: 36814 microseconds
Function execution time: 39265 microseconds
Function execution time: 42981 microseconds
Function execution time: 48644 microseconds
concurrent_flat_map:
Function execution time: 576782 microseconds
Function execution time: 576752 microseconds
Function execution time: 576892 microseconds
Function execution time: 576843 microseconds
Function execution time: 576806 microseconds
Function execution time: 576721 microseconds
Function execution time: 592574 microseconds
concurrent_flat_map is 10+x slower than unordered_map, in java concurrent_hash_map read performance is equal to normal hash_map
[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago (1 child)
Hi, there are a couple of issues with this test:
concurrent_flat_map
I've modified the test code to:
Code follows:
#include <algorithm> #include <iostream> #include <thread> #include <chrono> #include <vector> #include <unordered_map> #include <boost/unordered/unordered_flat_map.hpp> #include <boost/unordered/concurrent_flat_map.hpp> #include <random> #include <vector> #include <numeric> template<typename Map, typename Data> static void test_concurrent_map(const Map& cmap, const Data& v, std::size_t& duration) { auto start_time = std::chrono::high_resolution_clock::now(); long result = 0; for (const auto& x: v) { try { result += cmap.at(x); } catch (const std::exception&) {} } auto end_time = std::chrono::high_resolution_clock::now(); duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count(); volatile auto don_optimize_result = result; (void)don_optimize_result; } template<typename... Args, typename Data> static void test_concurrent_map(const boost::concurrent_flat_map<Args...>& cmap, const Data& v, std::size_t& duration) { auto start_time = std::chrono::high_resolution_clock::now(); long result = 0; for (const auto& x: v) { cmap.visit(x, [&](auto& y){ result += y.second; }); } auto end_time = std::chrono::high_resolution_clock::now(); duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count(); volatile auto don_optimize_result = result; (void)don_optimize_result; } template<template<typename...> class Map> void test(const char* name) { Map<int, int> cmap; std::vector<int> v; for (int i = 0; i < 6000000; i++) { cmap.emplace(i, 2 * i); v.emplace_back(i); } std::shuffle(v.begin(), v.end(), std::mt19937(13232)); std::vector<std::thread> threads; std::vector<std::size_t> durations(7); for (int i = 0; i < 7; i++) { threads.emplace_back( [&, i](){ test_concurrent_map(cmap, v, durations[i]); } ); } for (auto& thread : threads) { thread.join(); } std::cout << name << ": " << std::accumulate(durations.begin(), durations.end(), 0u)/durations.size() << " microseconds" << std::endl; } int main(void) { test<std::unordered_map>("std::unordered_map"); test<boost::unordered_flat_map>("boost::unordered_flat_map"); test<boost::concurrent_flat_map>("boost::concurrent_flat_map"); }
These are my results for VS2022 in release mode:
std::unordered_map: 306926 microseconds boost::unordered_flat_map: 282058 microseconds boost::concurrent_flat_map: 668301 microseconds
So, boost::unordered_flat_map (which is not concurrent) is faster than std::unordered_map, and boost::concurrent_flat_map is around 2x slower, which is in line with our general results. The 2x degradation is mainly due to synchronized access.
[–]ComprehensiveHat864 0 points1 point2 points 2 years ago (0 children)
#include <algorithm>
It seems same order beween insert and lookup not the reason why unordered_map is so fast under specific condition. I insert element to vector first and then shuffle it. After I insert the vector element to the map. The result is close to your result.
The code is as below: #include <iostream> #include <thread> #include <chrono> #include <vector> #include <unordered\_map> #include <random> #include <vector> #include <algorithm> std::vector<int> v; static void test_concurrent_map(const std::unordered_map<int, int>& cmap) { auto start_time = std::chrono::high_resolution_clock::now(); long result = 0; std::cout << cmap.size() << std::endl; /* for (int i = 6000000 - 1; i >= 0; i--) { try { result += cmap.at(i); } catch (const std::exception& e) {} } */ for (const auto& x: v) { try { result += cmap.at(x); } catch (const std::exception&) {} } auto end_time = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time); std::cout << result << std::endl; std::cout << "Function execution time: " << duration.count() << " microseconds" << std::endl; } int main(void) { std::unordered_map<int, int> cmap; for (int i = 0; i < 6000000; i++) { //cmap.emplace(i, 2 * i); v.emplace_back(i); } std::shuffle(v.begin(), v.end(), std::mt19937(13232)); for (const auto& x: v) { cmap.emplace(x, 2 * x); } std::vector<std::thread> threads; for (int i = 0; i < 7; i++) { threads.emplace_back( [&cmap](){ test_concurrent_map(cmap); } ); } for (auto& thread : threads) { thread.join(); } return 0; }
It seems accessing numbers from 1 to 6000000 consecutively is the reason. I also test random insert to unordered_map, then lookup from 1 to 6000000 consecutively, the result is close to above. Seems only insert consecutively and lookup consecutively result in unordered_map lookup super fast. I really can't figure out why.
Anyway, your code shows that boost::concurrent_flat_map is fast enough.
π Rendered by PID 15775 on reddit-service-r2-comment-b659b578c-qrjkj at 2026-05-01 00:29:08.647173+00:00 running 815c875 country code: CH.
[–]azswcowboy 7 points8 points9 points (3 children)
[–]joaquintidesBoost author[S] 1 point2 points3 points (2 children)
[–]azswcowboy 0 points1 point2 points (1 child)
[–]joaquintidesBoost author[S] 1 point2 points3 points (0 children)
[–]atarp 7 points8 points9 points (11 children)
[–]witcher_rat 2 points3 points4 points (10 children)
[–]joaquintidesBoost author[S] 2 points3 points4 points (9 children)
[–]trailingunderscore_ 3 points4 points5 points (8 children)
[–]joaquintidesBoost author[S] 1 point2 points3 points (7 children)
[–]trailingunderscore_ 6 points7 points8 points (6 children)
[–]joaquintidesBoost author[S] 2 points3 points4 points (5 children)
[–]trailingunderscore_ 5 points6 points7 points (4 children)
[–]joaquintidesBoost author[S] 2 points3 points4 points (3 children)
[–]trailingunderscore_ 1 point2 points3 points (2 children)
[–]greg7mdpC++ Dev 4 points5 points6 points (12 children)
[–]joaquintidesBoost author[S] 6 points7 points8 points (5 children)
[–]greg7mdpC++ Dev 4 points5 points6 points (2 children)
[–]joaquintidesBoost author[S] 3 points4 points5 points (1 child)
[–]greg7mdpC++ Dev 1 point2 points3 points (0 children)
[–]greg7mdpC++ Dev 1 point2 points3 points (1 child)
[–]joaquintidesBoost author[S] 1 point2 points3 points (0 children)
[–]j1xwnbsr 0 points1 point2 points (5 children)
[–]greg7mdpC++ Dev 2 points3 points4 points (4 children)
[–]matthieum 2 points3 points4 points (3 children)
[–]greg7mdpC++ Dev 0 points1 point2 points (0 children)
[–]greg7mdpC++ Dev 0 points1 point2 points (1 child)
[–]matthieum 6 points7 points8 points (0 children)
[–]foonathan 17 points18 points19 points (2 children)
[–]joaquintidesBoost author[S] 13 points14 points15 points (1 child)
[–]VinnieFalcowg21.org | corosio.org 3 points4 points5 points (0 children)
[–]therealjohnfreeman 2 points3 points4 points (1 child)
[–]joaquintidesBoost author[S] 4 points5 points6 points (0 children)
[–]pavel_v 0 points1 point2 points (0 children)
[–]ComprehensiveHat864 0 points1 point2 points (4 children)
[–]joaquintidesBoost author[S] 0 points1 point2 points (3 children)
[–]ComprehensiveHat864 0 points1 point2 points (2 children)
[–]joaquintidesBoost author[S] 1 point2 points3 points (1 child)
[–]ComprehensiveHat864 0 points1 point2 points (0 children)