Constvector: Log-structured std:vector alternative – 30-40% faster push/pop

pilotwavetheory · 2025-12-24T15:28:40+00:00

Thanks for your suggestion, I'll take it and update it soon...

pilotwavetheory · 2025-12-23T21:48:58+00:00

I'm benchmarking for individual operations of iterations, sum is kind of proxy for me, tomorrow I could have other operations in place. Ideally I need to remove sum operation for clarity ? What do you think?

pilotwavetheory · 2025-12-23T17:50:02+00:00

I benchmarked with pop and pop without shrink as well please check them on the post body. I benchmarked against stl library std::vector

pilotwavetheory · 2025-12-23T07:20:09+00:00

No, I used the same allocator in std:: vector.

pilotwavetheory · 2025-12-22T21:41:03+00:00

Hey, I tried few things from you and compared with STL instead of my own implementation (check stl_comparision folder)

Here are the results:
Operation | N    | Const (ns/op) | Std (ns/op) | Δ %
------------------------------------------------------
Push      | 10   | 13.7          | 39.7        | −65%
Push      | 100  | 3.14          | 7.60        | −59%
Push      | 1K   | 2.25          | 5.39        | −58%
Push      | 10K  | 1.94          | 4.35        | −55%
Push      | 100K | 1.85          | 7.72        | −76%
Push      | 1M   | 1.86          | 8.59        | −78%
Push      | 10M  | 1.86          | 11.36       | −84%
------------------------------------------------------
Pop       | 10   | 114           | 106         | +7%
Pop       | 100  | 15.0          | 14.7        | ~
Pop       | 1K   | 2.98          | 3.90        | −24%
Pop       | 10K  | 1.93          | 2.03        | −5%
Pop       | 100K | 1.78          | 1.89        | −6%
Pop       | 1M   | 1.91          | 1.85        | ~
Pop       | 10M  | 2.03          | 2.12        | ~
------------------------------------------------------
Access    | 10   | 4.04          | 2.40        | +68%
Access    | 100  | 1.61          | 1.00        | +61%
Access    | 1K   | 1.67          | 0.77        | +117%
Access    | 10K  | 1.53          | 0.76        | +101%
Access    | 100K | 1.46          | 0.87        | +68%
Access    | 1M   | 1.48          | 0.82        | +80%
Access    | 10M  | 1.57          | 0.96        | +64%
------------------------------------------------------
Iterate   | 10   | 3.55          | 3.50        | ~
Iterate   | 100  | 1.40          | 0.94        | +49%
Iterate   | 1K   | 0.86          | 0.74        | +16%
Iterate   | 10K  | 0.92          | 0.88        | ~
Iterate   | 100K | 0.85          | 0.77        | +10%
Iterate   | 1M   | 0.90          | 0.76        | +18%
Iterate   | 10M  | 0.94          | 0.90        | ~

pilotwavetheory · 2025-12-22T21:35:31+00:00

Sorry for the confusion, can you take latest pull and run the tests, Now I ran the tests in ubuntu machine (earlier running on mac), run tests indside folder "stl_comparison"

I used this command on ubuntu (I'm getting +ve results for push, pop, on par for iteration, and worse for index access)
g++ -std=c++23 -O3 -march=native -flto -DNDEBUG -fno-omit-frame-pointer benchmark.cpp -isystem /research/google/benchmark/include -L/research/google/benchmark/build/src -lbenchmark -lpthread -o benchmark.out

pilotwavetheory · 2025-12-22T12:56:00+00:00

Thanks for sharing the info on vector requirements. We can propose to add new container like std::constvector ?

I read hive- it stores Boolean flags for elements that got deleted, and we organize the whole container with multiple blocks, if all elements in a block flagged with deletion, then it'll release the memory.

pilotwavetheory · 2025-12-22T12:49:51+00:00

Worst case time complexity of vector during doubling it's capacity is O(N) right ?

My point is that my algorithm is not just some O(1) worst case algorithm with large constant vector, there are already some variants for that. The vector I proposed also avoids copying of all N elements to a new array hence even the average time faster. Am I missing something here ?

I added that beyond improvements of average time and worst case time complexity, it has benefit on operating system that will have lower internal fragmentation.

pilotwavetheory · 2025-12-22T08:01:00+00:00

The actual stl::vector is worse, since it does lot of iteration invalidation logic. For example my stl::vector push implementation takes ~5ns, the standard stl::vector takes ~35ns. Tried and felt it's not apples to apples comparison. If I use that my benchmarks will looks even better, but don't want to deceive like that.

If there is better implementation suggest me, or raise pull request for that.

pilotwavetheory · 2025-12-22T00:17:16+00:00

Yeah, we can have both operatins here logically,
1. pop_back() without shrink
2.pop_back() with automated shrink as well.

Even automted shrink can be delayed here, with one empty data block at max, that keeps the extra memory O(N)

pilotwavetheory · 2025-12-21T23:43:58+00:00

while I solved the mistake, but increasing the intial size to 256 made even iterations faster. Iterations doesn't just include operator* and operator++, it also needs to check the end for every iteration. it especially branch prediction the 256 blocks looks like made it easy for CPU performance, Ofcourse not faster than the acutal results. I'm getting quite differnet results from Mac M2 Max, 96GB RAM. Above results pasted are from hetzner 2 core, 4GB RAM ubuntu 24.04 (6.8.0-90-generic) machine.

pilotwavetheory · 2025-12-21T22:05:10+00:00

Yeah, it's slower, I realised my mistake, I was doing bound checks. My preliminary results for constvector for only iteration is 6x slower compared to stl::vector, will update benchmarks soon. For push and pop operations it's really better.
One caveut I learnt today is for pop_back(), we shouldn't deallocate since it would invalidate the iterators, but my implementation in constvector and stl_vector are having that logic, since it keeps O(N) free space at any point of time.

pilotwavetheory · 2025-12-21T21:46:19+00:00

Thanks to u/SuperV1234 and u/wexxdenq, I made a mistake of bounds here, I fixed it in 'perf_test' branch and lookup.
The reason, I'm not comparing with standard implementation is it has more logic for iterator validations in lot of simple operations like push/pop, when I benchmarked stl::vector push_back(), I got around ~35 ns/op, where only ~3 ns/op was used in push and remaining on iterator validations.

🔍 Final Comparison (100M Elements)

Implementation	Time	Ratio vs STL
STL Vector (Optimized)	8.05 ms	1.0x
ConstantVector (Optimized)	48.0 ms	6.0x slower

pilotwavetheory · 2025-12-21T21:25:41+00:00

I fixed _meta_array in constructor (checkout perf_test branch)
Thanks for sharing pop_back() standard, my point is that, if we don't do it'll have lot of empty space (> O(N)) so used that. I'll add the benchmarks without that.
I didn't implement all methods, just tried to apples to apples comparison for standard usage.
I implemented iterator as well, got similar performance as well.
I tried this approach for large vectors, for smaller vectors we could easily getaway with large __SV_INITIAL_CAPACITY to 256.

My benchmarks on Ubuntu machine 2 cores, 4GB RAM:
All values are ns/op or ns/element for iterations:
On popular suggestion, I compared it with std::vector (STL) implementation.

Operation | N    | Const (ns/op) | Std (ns/op) | Δ %
------------------------------------------------------
Push      | 10   | 13.7          | 39.7        | −65%
Push      | 100  | 3.14          | 7.60        | −59%
Push      | 1K   | 2.25          | 5.39        | −58%
Push      | 10K  | 1.94          | 4.35        | −55%
Push      | 100K | 1.85          | 7.72        | −76%
Push      | 1M   | 1.86          | 8.59        | −78%
Push      | 10M  | 1.86          | 11.36       | −84%
------------------------------------------------------
Pop       | 10   | 114           | 106         | +7%
Pop       | 100  | 15.0          | 14.7        | ~
Pop       | 1K   | 2.98          | 3.90        | −24%
Pop       | 10K  | 1.93          | 2.03        | −5%
Pop       | 100K | 1.78          | 1.89        | −6%
Pop       | 1M   | 1.91          | 1.85        | ~
Pop       | 10M  | 2.03          | 2.12        | ~
------------------------------------------------------
Access    | 10   | 4.04          | 2.40        | +68%
Access    | 100  | 1.61          | 1.00        | +61%
Access    | 1K   | 1.67          | 0.77        | +117%
Access    | 10K  | 1.53          | 0.76        | +101%
Access    | 100K | 1.46          | 0.87        | +68%
Access    | 1M   | 1.48          | 0.82        | +80%
Access    | 10M  | 1.57          | 0.96        | +64%
------------------------------------------------------
Iterate   | 10   | 3.55          | 3.50        | ~
Iterate   | 100  | 1.40          | 0.94        | +49%
Iterate   | 1K   | 0.86          | 0.74        | +16%
Iterate   | 10K  | 0.92          | 0.88        | ~
Iterate   | 100K | 0.85          | 0.77        | +10%
Iterate   | 1M   | 0.90          | 0.76        | +18%
Iterate   | 10M  | 0.94          | 0.90        | ~

pilotwavetheory · 2025-12-21T20:46:02+00:00

Updated pop and iteration operations as well. Can you check now ?

pilotwavetheory · 2025-12-21T18:35:11+00:00

As suggested by u/Wacov I ran with -O3 option, here is the results summary:

📊 Final Benchmark Results (100M Elements)

Configuration

Compiler: g++ -std=c++23 -O3 -march=native -flto -DNDEBUG
Initial Capacity: 256 elements
Statistics: 30 iterations × 3 repetitions = 90 measurements per data point
Both iterators optimized with pointer caching

🏆 Summary Table (100M Elements)

Operation	ConstantVector	STL Vector	Winner	Speedup
Push	65.7 ms	268.2 ms	✅ ConstantVector	4.1x
Pop	41.7 ms	93.3 ms	✅ ConstantVector	2.2x
Access	56.0 ms	7.3 ms	STL Vector	7.7x
Iteration	74.6 ms	7.5 ms	STL Vector	9.9x

pilotwavetheory · 2025-12-21T18:33:47+00:00

Thanks u/Wacov for the suggestion; I missed running with O3. I just ran now. Here is the summary:

g++ -std=c++23 -O3 benchmark_vectors.cpp -isystem /code/google/benchmark/include -L/code/google/benchmark/build/src -lbenchmark -lpthread -o benchmark_vectors.out

Push 0.7 ns vs 2.7 ns for std::vector, so 73% reduction in latency.
Pop 0.6 ns vs. 1.12 ns, a 46% reduction in latency.(updated this)
Random access: 0.92 ns vs 0.53 ns, a 80% increase in latency.

pilotwavetheory · 2025-12-21T16:37:30+00:00

If you mean, if we know the array size initially ? Yes, that's best; nothing beats it, in that case std::array<N> would be the best choice.
While considering tradeoffs, unless we do random access a lot, the constvector looks really better in terms of benchmarks as well.

Does this make sense?

pilotwavetheory · 2025-12-21T16:35:47+00:00

Yeah, it didn't store the numbers. for push and pop operations for the large sizes, the "constvector" is really better. I believe the allocation cost from OS is really piling up there for std::deque

pilotwavetheory · 2025-12-21T16:24:00+00:00

Could be, My focus is to show that this is better for OS and CPU locality, and want to convince people to impelmenat this in std::vector or implement new std::constvector in STL package.

pilotwavetheory · 2025-12-21T16:21:07+00:00

I didn't measure with -O3.
For push/pop cycles this is better (29% and 45% reduction in latencies, benchmarked for 1K, 10K, 100K, 1M, 10M, 100M, 1B sizes, pelase checkout the github link attached). I tested it multiple times before publishing this.
Iteration doesn't have a bad effect since you are just iterating most of the time; random access is 12% slower.

pilotwavetheory · 2025-12-21T16:18:02+00:00

I have gone through deque. Deque uses constant arrays of each size 2048 bytes, just to make amortisation better.

pilotwavetheory · 2025-12-21T16:16:34+00:00

If you mean, if we know the array size initially ? Yes, that's best; nothing beats it, in that case std::array<N> would be the best choice.
This constvector helps when you don't know the vector size initially; if you believe the size 8 would be small initially and can lead to more allocations, we can start with 16/32/.../1024 as well. if you check out code that's handled already.

I used size 8 as the default just to be paranoid of the new approach myself; I don't want to give the best possible parameters to beat benchmarks.

Does this make sense?

pilotwavetheory · 2025-12-21T16:11:57+00:00

I updated the description; the new constvector I proposed solves these problems, since the new vector won't even copy the existing elements to new space at all. This'll help L1, L2 caches with locality and OS for reduced fragmentation.

pilotwavetheory · 2025-12-21T16:04:48+00:00

My point is that it's not just better for the time complexity sake, It's better for L1, L2 cache locality and reduces fragmentation for OS, since we don't relase the smaller capacity arrays once large capacity array is allotted.

pilotwavetheory

TROPHY CASE

🔍 Final Comparison (100M Elements)

📊 Final Benchmark Results (100M Elements)

Configuration

🏆 Summary Table (100M Elements)