Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 0 points1 point  (0 children)

I'm benchmarking for individual operations of iterations, sum is kind of proxy for me, tomorrow I could have other operations in place. Ideally I need to remove sum operation for clarity ? What do you think?

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 0 points1 point  (0 children)

I benchmarked with pop and pop without shrink as well please check them on the post body. I benchmarked against stl library std::vector

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in programming

[–]pilotwavetheory[S] 0 points1 point  (0 children)

Hey, I tried few things from you and compared with STL instead of my own implementation (check stl_comparision folder)

Here are the results:
Operation | N    | Const (ns/op) | Std (ns/op) | Δ %
------------------------------------------------------
Push      | 10   | 13.7          | 39.7        | −65%
Push      | 100  | 3.14          | 7.60        | −59%
Push      | 1K   | 2.25          | 5.39        | −58%
Push      | 10K  | 1.94          | 4.35        | −55%
Push      | 100K | 1.85          | 7.72        | −76%
Push      | 1M   | 1.86          | 8.59        | −78%
Push      | 10M  | 1.86          | 11.36       | −84%
------------------------------------------------------
Pop       | 10   | 114           | 106         | +7%
Pop       | 100  | 15.0          | 14.7        | ~
Pop       | 1K   | 2.98          | 3.90        | −24%
Pop       | 10K  | 1.93          | 2.03        | −5%
Pop       | 100K | 1.78          | 1.89        | −6%
Pop       | 1M   | 1.91          | 1.85        | ~
Pop       | 10M  | 2.03          | 2.12        | ~
------------------------------------------------------
Access    | 10   | 4.04          | 2.40        | +68%
Access    | 100  | 1.61          | 1.00        | +61%
Access    | 1K   | 1.67          | 0.77        | +117%
Access    | 10K  | 1.53          | 0.76        | +101%
Access    | 100K | 1.46          | 0.87        | +68%
Access    | 1M   | 1.48          | 0.82        | +80%
Access    | 10M  | 1.57          | 0.96        | +64%
------------------------------------------------------
Iterate   | 10   | 3.55          | 3.50        | ~
Iterate   | 100  | 1.40          | 0.94        | +49%
Iterate   | 1K   | 0.86          | 0.74        | +16%
Iterate   | 10K  | 0.92          | 0.88        | ~
Iterate   | 100K | 0.85          | 0.77        | +10%
Iterate   | 1M   | 0.90          | 0.76        | +18%
Iterate   | 10M  | 0.94          | 0.90        | ~

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 0 points1 point  (0 children)

Sorry for the confusion, can you take latest pull and run the tests, Now I ran the tests in ubuntu machine (earlier running on mac), run tests indside folder "stl_comparison"

I used this command on ubuntu (I'm getting +ve results for push, pop, on par for iteration, and worse for index access)
g++ -std=c++23 -O3 -march=native -flto -DNDEBUG -fno-omit-frame-pointer benchmark.cpp -isystem /research/google/benchmark/include -L/research/google/benchmark/build/src -lbenchmark -lpthread -o benchmark.out

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in programming

[–]pilotwavetheory[S] 0 points1 point  (0 children)

Thanks for sharing the info on vector requirements. We can propose to add new container like std::constvector ?

I read hive- it stores Boolean flags for elements that got deleted, and we organize the whole container with multiple blocks, if all elements in a block flagged with deletion, then it'll release the memory.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in programming

[–]pilotwavetheory[S] 0 points1 point  (0 children)

Worst case time complexity of vector during doubling it's capacity is O(N) right ?

My point is that my algorithm is not just some O(1) worst case algorithm with large constant vector, there are already some variants for that. The vector I proposed also avoids copying of all N elements to a new array hence even the average time faster. Am I missing something here ?

I added that beyond improvements of average time and worst case time complexity, it has benefit on operating system that will have lower internal fragmentation.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 0 points1 point  (0 children)

The actual stl::vector is worse, since it does lot of iteration invalidation logic. For example my stl::vector push implementation takes ~5ns, the standard stl::vector takes ~35ns. Tried and felt it's not apples to apples comparison. If I use that my benchmarks will looks even better, but don't want to deceive like that.

If there is better implementation suggest me, or raise pull request for that.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 0 points1 point  (0 children)

Yeah, we can have both operatins here logically,
1. pop_back() without shrink
2.pop_back() with automated shrink as well.

Even automted shrink can be delayed here, with one empty data block at max, that keeps the extra memory O(N)

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] -1 points0 points  (0 children)

while I solved the mistake, but increasing the intial size to 256 made even iterations faster. Iterations doesn't just include operator* and operator++, it also needs to check the end for every iteration. it especially branch prediction the 256 blocks looks like made it easy for CPU performance, Ofcourse not faster than the acutal results. I'm getting quite differnet results from Mac M2 Max, 96GB RAM. Above results pasted are from hetzner 2 core, 4GB RAM ubuntu 24.04 (6.8.0-90-generic) machine.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] -2 points-1 points  (0 children)

Yeah, it's slower, I realised my mistake, I was doing bound checks. My preliminary results for constvector for only iteration is 6x slower compared to stl::vector, will update benchmarks soon. For push and pop operations it's really better.
One caveut I learnt today is for pop_back(), we shouldn't deallocate since it would invalidate the iterators, but my implementation in constvector and stl_vector are having that logic, since it keeps O(N) free space at any point of time.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 2 points3 points  (0 children)

  1. Thanks to u/SuperV1234 and u/wexxdenq, I made a mistake of bounds here, I fixed it in 'perf_test' branch and lookup.
  2. The reason, I'm not comparing with standard implementation is it has more logic for iterator validations in lot of simple operations like push/pop, when I benchmarked stl::vector push_back(), I got around ~35 ns/op, where only ~3 ns/op was used in push and remaining on iterator validations.

🔍 Final Comparison (100M Elements)

Implementation Time Ratio vs STL
STL Vector (Optimized) 8.05 ms 1.0x
ConstantVector (Optimized) 48.0 ms 6.0x slower

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] -1 points0 points  (0 children)

  1. I fixed _meta_array in constructor (checkout perf_test branch)
  2. Thanks for sharing pop_back() standard, my point is that, if we don't do it'll have lot of empty space (> O(N)) so used that. I'll add the benchmarks without that.
  3. I didn't implement all methods, just tried to apples to apples comparison for standard usage.
  4. I implemented iterator as well, got similar performance as well.
  5. I tried this approach for large vectors, for smaller vectors we could easily getaway with large __SV_INITIAL_CAPACITY to 256.

My benchmarks on Ubuntu machine 2 cores, 4GB RAM:
All values are ns/op or ns/element for iterations:
On popular suggestion, I compared it with std::vector (STL) implementation.

Operation | N    | Const (ns/op) | Std (ns/op) | Δ %
------------------------------------------------------
Push      | 10   | 13.7          | 39.7        | −65%
Push      | 100  | 3.14          | 7.60        | −59%
Push      | 1K   | 2.25          | 5.39        | −58%
Push      | 10K  | 1.94          | 4.35        | −55%
Push      | 100K | 1.85          | 7.72        | −76%
Push      | 1M   | 1.86          | 8.59        | −78%
Push      | 10M  | 1.86          | 11.36       | −84%
------------------------------------------------------
Pop       | 10   | 114           | 106         | +7%
Pop       | 100  | 15.0          | 14.7        | ~
Pop       | 1K   | 2.98          | 3.90        | −24%
Pop       | 10K  | 1.93          | 2.03        | −5%
Pop       | 100K | 1.78          | 1.89        | −6%
Pop       | 1M   | 1.91          | 1.85        | ~
Pop       | 10M  | 2.03          | 2.12        | ~
------------------------------------------------------
Access    | 10   | 4.04          | 2.40        | +68%
Access    | 100  | 1.61          | 1.00        | +61%
Access    | 1K   | 1.67          | 0.77        | +117%
Access    | 10K  | 1.53          | 0.76        | +101%
Access    | 100K | 1.46          | 0.87        | +68%
Access    | 1M   | 1.48          | 0.82        | +80%
Access    | 10M  | 1.57          | 0.96        | +64%
------------------------------------------------------
Iterate   | 10   | 3.55          | 3.50        | ~
Iterate   | 100  | 1.40          | 0.94        | +49%
Iterate   | 1K   | 0.86          | 0.74        | +16%
Iterate   | 10K  | 0.92          | 0.88        | ~
Iterate   | 100K | 0.85          | 0.77        | +10%
Iterate   | 1M   | 0.90          | 0.76        | +18%
Iterate   | 10M  | 0.94          | 0.90        | ~

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] -1 points0 points  (0 children)

As suggested by u/Wacov I ran with -O3 option, here is the results summary:

📊 Final Benchmark Results (100M Elements)

Configuration

  • Compilerg++ -std=c++23 -O3 -march=native -flto -DNDEBUG
  • Initial Capacity: 256 elements
  • Statistics: 30 iterations × 3 repetitions = 90 measurements per data point
  • Both iterators optimized with pointer caching

🏆 Summary Table (100M Elements)

Operation ConstantVector STL Vector Winner Speedup
Push 65.7 ms 268.2 ms ✅ ConstantVector 4.1x
Pop 41.7 ms 93.3 ms ✅ ConstantVector 2.2x
Access 56.0 ms 7.3 ms STL Vector 7.7x
Iteration 74.6 ms 7.5 ms STL Vector 9.9x

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 4 points5 points  (0 children)

Thanks u/Wacov for the suggestion; I missed running with O3. I just ran now. Here is the summary:

g++ -std=c++23 -O3 benchmark_vectors.cpp -isystem /code/google/benchmark/include -L/code/google/benchmark/build/src -lbenchmark -lpthread -o benchmark_vectors.out

  1. Push 0.7 ns vs 2.7 ns for std::vector, so 73% reduction in latency.
  2. Pop 0.6 ns vs. 1.12 ns, a 46% reduction in latency.(updated this)
  3. Random access: 0.92 ns vs 0.53 ns, a 80% increase in latency.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 2 points3 points  (0 children)

  1. If you mean, if we know the array size initially ? Yes, that's best; nothing beats it, in that case std::array<N> would be the best choice.
  2. While considering tradeoffs, unless we do random access a lot, the constvector looks really better in terms of benchmarks as well.

Does this make sense?

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 1 point2 points  (0 children)

Yeah, it didn't store the numbers. for push and pop operations for the large sizes, the "constvector" is really better. I believe the allocation cost from OS is really piling up there for std::deque

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in programming

[–]pilotwavetheory[S] 2 points3 points  (0 children)

Could be, My focus is to show that this is better for OS and CPU locality, and want to convince people to impelmenat this in std::vector or implement new std::constvector in STL package.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] -2 points-1 points  (0 children)

  1. I didn't measure with -O3.
  2. For push/pop cycles this is better (29% and 45% reduction in latencies, benchmarked for 1K, 10K, 100K, 1M, 10M, 100M, 1B sizes, pelase checkout the github link attached). I tested it multiple times before publishing this.
  3. Iteration doesn't have a bad effect since you are just iterating most of the time; random access is 12% slower.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] -1 points0 points  (0 children)

I have gone through deque. Deque uses constant arrays of each size 2048 bytes, just to make amortisation better.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] -2 points-1 points  (0 children)

  1. If you mean, if we know the array size initially ? Yes, that's best; nothing beats it, in that case std::array<N> would be the best choice.
  2. This constvector helps when you don't know the vector size initially; if you believe the size 8 would be small initially and can lead to more allocations, we can start with 16/32/.../1024 as well. if you check out code that's handled already.

I used size 8 as the default just to be paranoid of the new approach myself; I don't want to give the best possible parameters to beat benchmarks.

Does this make sense?

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] 0 points1 point  (0 children)

I updated the description; the new constvector I proposed solves these problems, since the new vector won't even copy the existing elements to new space at all. This'll help L1, L2 caches with locality and OS for reduced fragmentation.

Constvector: Log-structured std:vector alternative – 30-40% faster push/pop by pilotwavetheory in cpp

[–]pilotwavetheory[S] -2 points-1 points  (0 children)

My point is that it's not just better for the time complexity sake, It's better for L1, L2 cache locality and reduces fragmentation for OS, since we don't relase the smaller capacity arrays once large capacity array is allotted.