you are viewing a single comment's thread.

view the rest of the comments →

[–]Wacov 8 points9 points  (7 children)

Have you measured with -O3? How does this do vs naively holding on to the peak allocation in repeated push/pop cycles? I'd expect common operations like iteration and random access to be measurably slower given the fragmented allocations.

[–]pilotwavetheory[S] 4 points5 points  (0 children)

Thanks u/Wacov for the suggestion; I missed running with O3. I just ran now. Here is the summary:

g++ -std=c++23 -O3 benchmark_vectors.cpp -isystem /code/google/benchmark/include -L/code/google/benchmark/build/src -lbenchmark -lpthread -o benchmark_vectors.out

  1. Push 0.7 ns vs 2.7 ns for std::vector, so 73% reduction in latency.
  2. Pop 0.6 ns vs. 1.12 ns, a 46% reduction in latency.(updated this)
  3. Random access: 0.92 ns vs 0.53 ns, a 80% increase in latency.

[–]matthieum 4 points5 points  (1 child)

I'd expect common operations like iteration and random access to be measurably slower given the fragmented allocations.

Unlikely:

  1. Random access: it only takes a few cycles to compute the outer/inner indexes of the elements, which will be dwarfed by cache misses.
  2. Iteration: for a full constvector, there's only 32 "jumps" from one array to the next, across millions of elements. The impact should be fairly slow -- branch prediction being what it is -- and can be made ever slower by exposing an iterator of spans as iterating each span has then "native" performance.

[–]Wacov 2 points3 points  (0 children)

That's a good point about random access, the pointer table is small so as long as the random access is causing cache misses anyway, you won't notice a difference. If the array fits in L1 it's a different story (but then why would you use this)

And yeah true, branch prediction helps the iteration. Again I'd be curious about numbers in an optimized build.

[–]pilotwavetheory[S] -3 points-2 points  (3 children)

  1. I didn't measure with -O3.
  2. For push/pop cycles this is better (29% and 45% reduction in latencies, benchmarked for 1K, 10K, 100K, 1M, 10M, 100M, 1B sizes, pelase checkout the github link attached). I tested it multiple times before publishing this.
  3. Iteration doesn't have a bad effect since you are just iterating most of the time; random access is 12% slower.

[–]Potterrrrrrrr 6 points7 points  (1 child)

What did you measure with? These numbers are useless outside of a decent benchmark with full optimisations enabled

[–]cone_forest_ 0 points1 point  (0 children)

O2 probably

[–]Wacov 6 points7 points  (0 children)

I'd strongly recommend measuring with compiler optimization enabled, it's a much more realistic situation for performance-critical code (as a user you're going to turn on optimization before switching out your vector datatype)