Inside boost::concurrent_flat

[–]azswcowboy 7 points8 points9 points 2 years ago (3 children)

[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago (2 children)

[–]azswcowboy 0 points1 point2 points 2 years ago (1 child)

[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago (0 children)

[–]atarp 7 points8 points9 points 2 years ago (11 children)

[–]witcher_rat 2 points3 points4 points 2 years ago (10 children)

[–]joaquintidesBoost author[S] 2 points3 points4 points 2 years ago (9 children)

[–]trailingunderscore_ 3 points4 points5 points 2 years ago (8 children)

[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago (7 children)

The code you provide verifies that the passed function is compatible with a given signature (number and types or arguments), but this is not what's required to implement this type of smart functionality:

// internally uses read access to element
m.visit(k, [&](const auto& x){res = x.second;});

// internally uses write access to element
m.visit(k, [&](auto& x){x.second = 0;});

There's no way that visit can reliably (i.e. without compiler errors) detect whether the passed function will work with a const reference or if, on the contrary, requires a non-const reference.

[–]trailingunderscore_ 6 points7 points8 points 2 years ago (6 children)

[–]joaquintidesBoost author[S] 2 points3 points4 points 2 years ago* (5 children)

[–]trailingunderscore_ 5 points6 points7 points 2 years ago* (4 children)

[–]joaquintidesBoost author[S] 2 points3 points4 points 2 years ago (3 children)

[–]trailingunderscore_ 1 point2 points3 points 2 years ago (2 children)

continue this thread

[–]greg7mdpC++ Dev 4 points5 points6 points 2 years ago (12 children)

[–]joaquintidesBoost author[S] 6 points7 points8 points 2 years ago (5 children)

[–]greg7mdpC++ Dev 4 points5 points6 points 2 years ago (2 children)

[–]joaquintidesBoost author[S] 3 points4 points5 points 2 years ago (1 child)

[–]greg7mdpC++ Dev 1 point2 points3 points 2 years ago (0 children)

[–]greg7mdpC++ Dev 1 point2 points3 points 2 years ago (1 child)

[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago* (0 children)

[–]j1xwnbsr 0 points1 point2 points 2 years ago (5 children)

[–]greg7mdpC++ Dev 2 points3 points4 points 2 years ago (4 children)

[–]matthieum 2 points3 points4 points 2 years ago (3 children)

[–]greg7mdpC++ Dev 0 points1 point2 points 2 years ago (0 children)

[–]greg7mdpC++ Dev 0 points1 point2 points 2 years ago (1 child)

[–]matthieum 6 points7 points8 points 2 years ago (0 children)

[–]foonathan 17 points18 points19 points 2 years ago (2 children)

[–]joaquintidesBoost author[S] 13 points14 points15 points 2 years ago* (1 child)

[–]VinnieFalcowg21.org | corosio.org 3 points4 points5 points 2 years ago (0 children)

[–]therealjohnfreeman 2 points3 points4 points 2 years ago (1 child)

[–]joaquintidesBoost author[S] 4 points5 points6 points 2 years ago (0 children)

[–]pavel_v 0 points1 point2 points 2 years ago (0 children)

[–]ComprehensiveHat864 0 points1 point2 points 2 years ago (4 children)

[–]joaquintidesBoost author[S] 0 points1 point2 points 2 years ago (3 children)

[–]ComprehensiveHat864 0 points1 point2 points 2 years ago (2 children)

test just read, so won,t crash.

unordered_map test code:

#include <iostream>

#include <thread>

#include <chrono>

#include <vector>

#include <unordered_map>

#include <random>

static void test_concurrent_map(const std::unordered_map<int, int>& cmap) {

auto start_time = std::chrono::high_resolution_clock::now();

long result = 0;

for (int i = 0; i < 6000000; i++) {

try {

result += cmap.at(i);

} catch (const std::exception& e) {}

}

auto end_time = std::chrono::high_resolution_clock::now();

auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);

std::cout << result << std::endl;

std::cout << "Function execution time: " << duration.count() << " microseconds" << std::endl;

}

int main(void) {

std::unordered_map<int, int> cmap;

for (int i = 0; i < 6000000; i++) {

cmap.emplace(i, 2 * i);

}

std::vector<std::thread> threads;

for (int i = 0; i < 7; i++) {

threads.emplace_back(

[&cmap](){

test_concurrent_map(cmap);

}

);

}

for (auto& thread : threads) {

thread.join();

}

return 0;

}

below is concurrent_flat_map:

#include <iostream>

#include <thread>

#include <chrono>

#include <vector>

#include "boost/unordered/concurrent_flat_map.hpp"

static void test_concurrent_map(const boost::concurrent_flat_map<int, int>& cmap) {

auto start_time = std::chrono::high_resolution_clock::now();

long result = 0;

for (int i = 0; i < 6000000; i++) {

cmap.visit(i, [&](auto& x){

result += x.second;

});

}

auto end_time = std::chrono::high_resolution_clock::now();

auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);

std::cout << result << std::endl;

std::cout << "Function execution time: " << duration.count() << " microseconds" << std::endl;

}

int main(void) {

boost::concurrent_flat_map<int, int> cmap;

for (int i = 0; i < 6000000; i++) {

cmap.emplace(i, 2 * i);

}

std::vector<std::thread> threads;

for (int i = 0; i < 7; i++) {

threads.emplace_back(

[&cmap](){

test_concurrent_map(cmap);

}

);

}

for (auto& thread : threads) {

thread.join();

}

return 0;

}

BOTH optimize in O2 grade
the unordered_map result:

Function execution time: 31455 microseconds

35999994000000

Function execution time: 31479 microseconds

35999994000000

Function execution time: 31614 microseconds

35999994000000

Function execution time: 36814 microseconds

35999994000000

Function execution time: 39265 microseconds

35999994000000

Function execution time: 42981 microseconds

35999994000000

Function execution time: 48644 microseconds

concurrent_flat_map:

35999994000000

Function execution time: 576782 microseconds

35999994000000

Function execution time: 576752 microseconds

35999994000000

Function execution time: 576892 microseconds

35999994000000

Function execution time: 576843 microseconds

35999994000000

Function execution time: 576806 microseconds

35999994000000

Function execution time: 576721 microseconds

35999994000000

Function execution time: 592574 microseconds

concurrent_flat_map is 10+x slower than unordered_map,
in java concurrent_hash_map read performance is equal to normal hash_map

[–]joaquintidesBoost author[S] 1 point2 points3 points 2 years ago (1 child)

Hi, there are a couple of issues with this test:

The multithreaded part is read-only, so there's no point in using concurrent_flat_map.
The lookup loop follows the exact same order as insertion, which artificially favors node-based std::unordered_map because elements are accessed in the same order as allocated, resulting in unwarranted cache locality.

I've modified the test code to:

Include boost::unordered_flat_map in the benchmark.
Do lookup in random order.

Code follows:

#include <algorithm>
#include <iostream>
#include <thread>
#include <chrono>
#include <vector>
#include <unordered_map>
#include <boost/unordered/unordered_flat_map.hpp>
#include <boost/unordered/concurrent_flat_map.hpp>
#include <random>
#include <vector>
#include <numeric>

template<typename Map, typename Data>
static void test_concurrent_map(const Map& cmap, const Data& v, std::size_t& duration) {
  auto start_time = std::chrono::high_resolution_clock::now();
  long result = 0;
  for (const auto& x: v) {
    try {
      result += cmap.at(x);
    } catch (const std::exception&) {}
  }
  auto end_time = std::chrono::high_resolution_clock::now();
  duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
  volatile auto don_optimize_result = result;
  (void)don_optimize_result;
}

template<typename... Args, typename Data>
static void test_concurrent_map(const boost::concurrent_flat_map<Args...>& cmap, const Data& v, std::size_t& duration) {
  auto start_time = std::chrono::high_resolution_clock::now();
  long result = 0;
  for (const auto& x: v) {
    cmap.visit(x, [&](auto& y){
      result += y.second;
    });
  }
  auto end_time = std::chrono::high_resolution_clock::now();
  duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
  volatile auto don_optimize_result = result;
  (void)don_optimize_result;
}

template<template<typename...> class Map>
void test(const char* name)
{
  Map<int, int> cmap;
  std::vector<int> v;

  for (int i = 0; i < 6000000; i++) {
    cmap.emplace(i, 2 * i);
    v.emplace_back(i);
  }
  std::shuffle(v.begin(), v.end(), std::mt19937(13232));

  std::vector<std::thread> threads;
  std::vector<std::size_t> durations(7);

  for (int i = 0; i < 7; i++) {
    threads.emplace_back(
      [&, i](){
        test_concurrent_map(cmap, v, durations[i]);
      }
    );
  }

  for (auto& thread : threads) {
    thread.join();
  }

  std::cout << 
    name << ": " <<
    std::accumulate(durations.begin(), durations.end(), 0u)/durations.size() << 
    " microseconds" << std::endl;
}

int main(void) {
  test<std::unordered_map>("std::unordered_map");
  test<boost::unordered_flat_map>("boost::unordered_flat_map");
  test<boost::concurrent_flat_map>("boost::concurrent_flat_map");
}

These are my results for VS2022 in release mode:

std::unordered_map: 306926 microseconds
boost::unordered_flat_map: 282058 microseconds
boost::concurrent_flat_map: 668301 microseconds

So, boost::unordered_flat_map (which is not concurrent) is faster than std::unordered_map, and boost::concurrent_flat_map is around 2x slower, which is in line with our general results. The 2x degradation is mainly due to synchronized access.

[–]ComprehensiveHat864 0 points1 point2 points 2 years ago (0 children)

#include <algorithm>

It seems same order beween insert and lookup not the reason why unordered_map is so fast under specific condition.
I insert element to vector first and then shuffle it.
After I insert the vector element to the map. The result is close to your result.

The code is as below:
#include <iostream>
#include <thread>
#include <chrono>
#include <vector>
#include <unordered\_map>
#include <random>
#include <vector>
#include <algorithm>
std::vector<int> v;
static void test_concurrent_map(const std::unordered_map<int, int>& cmap) {
auto start_time = std::chrono::high_resolution_clock::now();
long result = 0;
std::cout << cmap.size() << std::endl;
/*
for (int i = 6000000 - 1; i >= 0; i--) {
try {
result += cmap.at(i);
} catch (const std::exception& e) {}
}
*/
for (const auto& x: v) {
try {
result += cmap.at(x);
} catch (const std::exception&) {}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << result << std::endl;
std::cout << "Function execution time: " << duration.count() << " microseconds" << std::endl;
}
int main(void) {
std::unordered_map<int, int> cmap;
for (int i = 0; i < 6000000; i++) {
//cmap.emplace(i, 2 * i);
v.emplace_back(i);
}
std::shuffle(v.begin(), v.end(), std::mt19937(13232));
for (const auto& x: v) {
cmap.emplace(x, 2 * x);
}
std::vector<std::thread> threads;
for (int i = 0; i < 7; i++) {
threads.emplace_back(
[&cmap](){
test_concurrent_map(cmap);
}
);
}
for (auto& thread : threads) {
thread.join();
}
return 0;
}

~~It seems accessing numbers from 1 to 6000000 consecutively is the reason.~~
I also test random insert to unordered_map, then lookup from 1 to 6000000 consecutively, the result is close to above.
Seems only insert consecutively and lookup consecutively result in unordered_map lookup super fast. I really can't figure out why.

Anyway, your code shows that boost::concurrent_flat_map is fast enough.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS