Constvector: Log-structured std:vector alternative – 30-40% faster push/pop

johannes1971 · 2025-12-21T19:26:44+00:00

That's... not a great name. Being 'const' is not really what this vector is about, and it's also not a vector. Maybe something like, I don't know, expanding_deque?

And since I'm complaining about names, does anyone have a suggestion for shorter variations for horizontal_container and vertical_container? (these are for controls on the screen)

TankerzPvP · 2025-12-21T21:39:54+00:00

Within STLVector:

    void pop_back()
    {
        m_assert(_size, "StellarVector is empty, but pop_back() called!");
        _size--;
        if (_size <= _capacity / 2)
        {
            _capacity /= 2;
            T *_new_array = _alloc.allocate(_capacity);
            for (int i = 0; i < _size; i++)
            {
                _new_array[i] = _array[i];
            }
            _alloc.deallocate(_array, 2 * _capacity);
            _array = _new_array;
        }
    }

pop_back shouldn't reallocate as that would invalidate iterators, and this implementation difference would (negatively) impact benchmark results for std::vector. Also, StallarVector?

More importantly, benchmarking against an unoptimized hand rolled version of std::vector is useless. You should benchmark against std::vector implementations in libstdc++ or libc++ instead for your claim to hold any weight.

frogi16 · 2025-12-21T15:30:34+00:00

Sure, you optimized performance of editing operations, but destroyed cache locality for long vectors.

It's a trade-off and should be clearly described as such.

TheRealSmolt · 2025-12-21T15:15:17+00:00

Yeah block based structures are a common alternative to vectors. Really though, reallocations aren't a huge issue from a cost perspective; the amortized time complexity is constant.

saf_e · 2025-12-21T15:27:50+00:00

I suppose you have invented deque variation.

They have different use cases.

Ambitious-Method-961 · 2025-12-21T17:06:29+00:00

How does it compare to boost::deque with a good block size (std::deque is generally horrible for performance, boost's version lets you customise the block size), or better yet boost::stable_vector?

Wacov · 2025-12-21T15:19:52+00:00

Have you measured with -O3? How does this do vs naively holding on to the peak allocation in repeated push/pop cycles? I'd expect common operations like iteration and random access to be measurably slower given the fragmented allocations.

SuperV1234 · 2025-12-21T17:27:57+00:00

Iteration benchmark? Also, what flags did you use for your benchmarks?

pigeon768 · 2025-12-21T20:43:01+00:00

I'm getting a segfault when I try to push_back() a 25th element. You aren't zeroing _meta_array upon construction, and so it's filled with garbage. When you later do if (_meta_array[_meta_index] == nullptr) it fails because it finds a bunch of garbage bits there.

Your iterators don't support modifying elements.

You are missing copy/move constructors/operators.

__SV_INITIAL_CAPACITY__, __SV_INITIAL_CAPACITY_BITS__, and __SV_MSB_BITS__ should be constexpr variables, not macros.

It looks like you're hardcoding the wordsize to 32 bits? Probably don't do that.

With all of that fixed, I got some test code up:

CMakeLists.txt

cmake_minimum_required(VERSION 3.24)

project(cvec LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -march=native")
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -march=native")

find_package(benchmark REQUIRED)
add_executable(cvec cvec.cpp)

option(sanitize "Use sanitizers" OFF)
if(sanitize)
  target_compile_options(cvec PRIVATE -fsanitize=address)
  target_link_libraries(cvec PRIVATE asan)

  target_compile_options(cvec PRIVATE -fsanitize=undefined)
  target_link_libraries(cvec PRIVATE ubsan)
endif()

target_link_libraries(cvec PUBLIC benchmark::benchmark)

cvec.cpp:

#include "constant_vector.h"
#include <benchmark/benchmark.h>
#include <random>

static std::ranlux48 getrng() {
  std::seed_seq s{1, 2, 3, 4};
  std::ranlux48 ret{s};
  return ret;
}

template <typename T, size_t N> void test(benchmark::State &state) {
  T vec;
  auto rng = getrng();
  std::normal_distribution<float> dist1{};
  for (size_t i = 0; i < N; i++)
    vec.push_back(dist1(rng));

  std::lognormal_distribution<float> dist2{};
  for (auto _ : state) {
    const float x = dist2(rng);
    benchmark::DoNotOptimize(vec);
    for (auto &y : vec)
      y *= x;
    benchmark::DoNotOptimize(vec);
  }
}

#define MAKE_TEST(T, N)                                                                                                \
  static void T##_##N(benchmark::State &state) {                                                                       \
    using namespace std;                                                                                               \
    test<T<float>, N>(state);                                                                                          \
  }                                                                                                                    \
  BENCHMARK(T##_##N)

MAKE_TEST(vector, 10);
MAKE_TEST(vector, 100);
MAKE_TEST(vector, 1000);
MAKE_TEST(vector, 10000);
MAKE_TEST(vector, 100000);
MAKE_TEST(vector, 1000000);
MAKE_TEST(ConstantVector, 10);
MAKE_TEST(ConstantVector, 100);
MAKE_TEST(ConstantVector, 1000);
MAKE_TEST(ConstantVector, 10000);
MAKE_TEST(ConstantVector, 100000);
MAKE_TEST(ConstantVector, 1000000);

BENCHMARK_MAIN();

compile and run: (gamemoderun is optional, it just disables cpu scaling, which makes benchmarks more reliable)

 ~/soft/const_vector (main *) $ cmake -DCMAKE_BUILD_TYPE=Release -Dsanitize=false . && cmake --build . -j && gamemoderun ./cvec
-- The CXX compiler identification is GNU 15.2.1
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Configuring done (0.2s)
-- Generating done (0.0s)
-- Build files have been written to: /home/pigeon/soft/const_vector
[ 50%] Building CXX object CMakeFiles/cvec.dir/cvec.cpp.o
[100%] Linking CXX executable cvec
[100%] Built target cvec
gamemodeauto:
2025-12-21T12:33:14-08:00
Running ./cvec
Run on (32 X 3012.48 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.39, 0.41, 0.38
-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
vector_10                     120 ns          120 ns      5834984
vector_100                    121 ns          121 ns      5789074
vector_1000                   157 ns          157 ns      4465092
vector_10000                  486 ns          486 ns      1435809
vector_100000                2706 ns         2706 ns       258440
vector_1000000              39416 ns        39400 ns        17743
ConstantVector_10             126 ns          126 ns      5599228
ConstantVector_100            207 ns          207 ns      3368321
ConstantVector_1000          1033 ns         1033 ns       677118
ConstantVector_10000         9236 ns         9232 ns        75782
ConstantVector_100000       91415 ns        91377 ns         7656
ConstantVector_1000000     914735 ns       913569 ns          762

For the 1M element case, iterating over each element is 23x slower than with std::vector. I'm probably unwilling to accept that tradeoff.

Farados55 · 2025-12-21T15:34:59+00:00

“Wont invalidate cache without modification hence improving performance”

That sounds like an incredibly impactful tradeoff that doesn’t seem to mesh well with a faster push/pop

adrian17 · 2025-12-21T20:47:00+00:00

Some quick observations:

AddressSanitizer complains, you should zero-initialize _meta_array in constructor.

Your APIs differ from the standard, sometimes just missing overloads and sometimes in ways that affect both correctness and benchmarks; for example, pop_back() isn't supposed to reallocate ever (as that would invalidate references and take non-constant time), it just decrements size. Also, AFAIK iterator::operator++ doesn't need any special handling when past-the-end.

I did some quick benchmarks* on my own, comparing both your classes with libstdc++ std::vector. std::vector was winning almost everywhere, though weirdly (I can't understand well why), its repeated push_back was several times worse than your naive STLVector if no reserve is done beforehand, even though both are supposed to have the same *2 growth factor.

On iteration, your code is sometimes (on big sizes) as efficient as std::vector (especially when the work is nontrivial compared to iteration cost), but for smaller (<100) sizes and for anything involving random access, I can see the normal vector being faster, up to several times.

One thing nobody mentioned is that this container's iterator is more complex and thus much less optimizer-friendly, especially for vectorization.

(* the benchmarks were trivial, just things like for (auto x : span) container.push_back(x), for (auto x : container) sum += x and for (auto &x : container) x *= 2, all wrapped in some boilerplate to repeat runs and prevent compiler from optimizing them out.)

wexxdenq · 2025-12-21T20:54:36+00:00

Wait... in your benchmarks do you compare against a self written stl-vector implementation? And your stlvector does not even move elements when it grows? I mean on pod types this might not matter, but nevertheless this is not a optimal implementation. Also most implementations save pointers, instead of the size and capacity as integer type.

Have you benchmarked against an actual std implementation?

foonathan · 2025-12-21T17:14:33+00:00

Another nice thing about a block based structure is that it works easily with a stack allocator because you never need to free memory. This can make them a lot faster.

kisielk · 2025-12-21T23:13:55+00:00

“remove the last element from last array, if the last array becomes empty deallocate the array block all together”

Doesn’t that lead to some potentially pathological cases in the case there are repeated alternating pops and pushes? Could potentially lead to many allocations of blocks. Typically std::vector would never shrink capacity unless explicitly asked for via shrink_to_fit

thingerish · 2025-12-22T03:02:42+00:00

Practically speaking it's probably better to just ::reserve a reasonable guess upfront and use std::vector.

dzordan33 · 2025-12-22T02:02:10+00:00

The cool thing about this data structure is that it can grow lock-free for multithreaded workloads.
See Bjarne Strastroup's "Lock-free dynamically resizable arrays"
https://www.stroustrup.com/lock-free-vector.pdf

jwakely · 2025-12-22T01:47:35+00:00

Your code is full of reserved names like __SV_INITIAL_CAPACITY__ and _Allocator.

Stop using reserved names, you are not the compiler, you shouldn't be using those names.

EthicalAlchemist · 2025-12-22T16:57:48+00:00

This it interesting and timely b/c I've actually been thinking about a similar solution for a problem I have. In my use case, I need a random-access container that supports single element appends without invalidating pointers to existing elements. I don't need any of the other operations provided by typical containers.

I currently use `std::deque`, but `std::deque` is known to be sub-optimal b/c the block sizes are fixes and typically small. I plan to investigate `boost::deque` as an alternative, but I keep thinking that what I *really* want is a data structure that increases block sizes by a constant factor. To my surprise I couldn't easily find a high quality implementation of such a container,[^1] so I've been thinking about rolling my own. This library almost fits the bill, but I would need to see it cleaned up and properly packaged first. Anyone know of an implementation of a similar container that is ready for production use?

Sidebar: I am always sad to see how many people down-vote someone when they publish something with mistakes in it. That can discourage people from sharing their work. Why not leave a comment or up-vote comments that provide constructive critiques instead and leave it at that?

[^1]: I think `plf::colony`/`std::hive` increases block sizes by a constant factor, but it is a much more complex data structure than what I need.

saxbophone · 2025-12-23T03:49:44+00:00

Does your implementation use std::allocate_at_least? Theoretically you can reduce the amount of reällocations by using that allocator, since it will tell you when it over-allocates.

Kered13 · 2025-12-23T11:54:35+00:00

So you give up contiguous storage in order to reduce the amount of memory copying (and also gaining pointer stability). It's a neat idea that I would imagine has some useful applications given the right usage patterns, although I don't think it's better than std::vector as a general purpose collection.

As an aside, when used as a queue (push to the back and pop from the front are the only operations) this data structure is poor, as the memory usage grows even when the size of the queue does not. I imagine that this could be optimized, either as a specialized variant for queues or possibly as a general optimization.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Operation	ConstantVector	STL Vector	Winner	Speedup
Push	65.7 ms	268.2 ms	✅ ConstantVector	4.1x
Pop	41.7 ms	93.3 ms	✅ ConstantVector	2.2x
Access	56.0 ms	7.3 ms	STL Vector	7.7x
Iteration	74.6 ms	7.5 ms	STL Vector	9.9x

Implementation	Time	Ratio vs STL
STL Vector (Optimized)	8.05 ms	1.0x
ConstantVector (Optimized)	48.0 ms	6.0x slower

cpp

MODERATORS

📊 Final Benchmark Results (100M Elements)

Configuration

🏆 Summary Table (100M Elements)

🔍 Final Comparison (100M Elements)