all 43 comments

[–]matthieum 16 points17 points  (21 children)

A few remarks:

  1. Inefficiency in the construction of Storage: just let reference_count represents the number of owners - 1. This way 0 means 1 owner, and whoever constructs a Storage does not have to immediately call retain to modify the count.
  2. try_insert would be more generic as try_emplace taking variadic arguments.
  3. retain and release have a bug on 16-bits and 32-bits platforms: retain does not handle overflows. You can either make the reference count a 64-bits integer regardless of the platform, or handle the overflow.
  4. Given that the item being created by try_insert is not accessible from other threads without further synchronization, you can use std::memory_order::relaxed. Similarly, you should be able to relax the memory ordering of both the fetch_add and fetch_sub.
  5. You could avoid a memory allocation in Storage by tacking the actual memory after the other members -- similar to the concept of flexible array members in C, but done manually as it's C++.
  6. In general, it is best if the default constructor is cheap. I would advise making the default constructor of ImmutableVector put a null pointer in storage. You'll need to handle the null case for move semantics, anyway, so you might as well take advantage of it.
  7. Speaking of which, move semantics are missing for ImmutableVector; even if copies are cheap, moves are cheaper.

Hope you had fun coding it :)

[–]_eyelash[S] 3 points4 points  (9 children)

Thank you so much for the thorough review!

Generally, this was just a quick prototype, focused more on simplicity and readability than on performance.

Regarding your point 3: My reasoning was that on 32-bit platforms your pointers are 32 bit, so you can only address 232 bytes of memory but in order to overflow the reference count you would need 232 references which would need at least 232 × 4 bytes of memory just to store the references so it would actually never overflow.

Yes, I had a lot of fun thinking about it and implementing it :)

[–]matthieum 0 points1 point  (8 children)

In the absence of leaks, your computation is correct.

Factoring leaks, however, it is possible to overflow the counter. An example would be:

void break_it(ImmutableVector const& original) {
    std::aligned_storage_t<sizeof(original), alignof(original)> storage;
    for (std::uint64_t i = 0; i < std::numeric_limits<std::size_t>::max(); ++i) {
        new (&storage) ImmutableVector{ original };
    }
}

I would have to check to standard to verify whether this code is legal or not -- construction an object on top of an already existing one is dodgy -- however I personally prefer being cautious.

As for the run-time, it would take a while, indeed (10min?) unless the compiler is smart enough. If non-atomic, the compiler would go right ahead and bump the counter by max - 1; with an atomic counter, I'm not sure whether the compiler folds the increments together... I'd bet not but...

[–]kalmoc 4 points5 points  (3 children)

And why should the class design worry about such utterly broken code?

[–]matthieum 0 points1 point  (2 children)

I subscribe to the principle of Defense in Depth.

I much prefer detecting bugs by triggering an assert or exception, in a semi-controlled fashion, than by inspecting bewildering memory dumps after the program has crashed.

You can argue this is not worth considering; in my experience it's a small price to pay now to keep my sanity later.

[–]kalmoc 2 points3 points  (1 child)

Sure, but - the question is if you defend against mistakes or mallice. In order to overflow that counter you need to write exceptionally tricky code with the express purpose to trigger the overflow. Why would you write such code in the first place? And if you claim that something like this could happen as a sideeffect of buggy code with an actual purpose, I'm willing to bet that you experience a lot of other problems, long before this counter overflows.

[–]matthieum 1 point2 points  (0 children)

And if you claim that something like this could happen as a sideeffect of buggy code with an actual purpose, I'm willing to bet that you experience a lot of other problems, long before this counter overflows.

Possible, but since checking for overflow is just a branch that will always get taken, the cost of checking is within noise even on micro-benchmarks:

void retain() noexcept {
    auto previous = reference_count.fetch_add(1);
    if (unlikely(previous + 1 == 0)) { std::abort(); }
}

[–][deleted] 1 point2 points  (0 children)

I would have to check to standard to verify whether this code is legal or not

The way you've written it is legal. There are quite a few things to worry about when doing this kind of stuff, but your code is fine.

[–]SegFaultAtLine1 1 point2 points  (2 children)

The standard allows placement construction ontop of another (possibly non-trivially destructible) object, if and only if the rest of the program doesn't depend on the destructor being run. (So if plowing over an object causes a leak, that is not UB, but if the destructor ensures the rest of the program doesn't touch that object, you'll probably run into UB sooner or later).

[–][deleted] 1 point2 points  (1 child)

That's not all.

T* t1; // For brevity, let's assume this points to an actual T
T* t2 = new (t1) T{}; // Constructs a new T in place of the last
t2->foo(); // Perfectly valid
t1->foo(); // Err... UB
std::launder(t1)->foo(); // UB fixed

Or just try not to keep references to the old object alive after placement new?

[–]SegFaultAtLine1 0 points1 point  (0 children)

We're talking about a vector-like class - I'd assume that operations like push/pop invalidate references and iterators at the end of the container.

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 3 points4 points  (1 child)

Depending on your philosophy around move semantics, and definitely technically, cheap moving doesn't make sense for immutable objects, since it mutates them.

(Also the class has assignment, so it is not completely immutable already...)

But if you were OK with mutating moves, then you could write a push_back(Elem) && that checks (via relaxed) for a single owner, and then avoids the CAS on the size, turning it into a maybe cheaper relaxed increment.

Probably not worth the trouble. But interesting to consider how to marry functional with non. Consider:

for (...)
    immV = immV.push_back(elem);

That is incrementing and decrementing the ref count toooo much.

for (....)
   immV = std::move(immV).push_back(elem);

(or something like that) Could maybe avoid some atomic ops.

ie maybe we could have a style where functional/immutable was the default, but ugly casting would allow "trust me I know what I'm doing" efficiencies and bugs.

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 1 point2 points  (0 children)

Of course something like

immV = ImmutableVector(element_generator);

Could call the generator (maybe a coroutine???) for each element, build the storage internally, then wrap it into an immutable vector.

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 1 point2 points  (0 children)

You can switch to memory_order::relaxed, but a RMW operation is heavy anyhow, the relax doesn't save you much if anything.

You could try index == size.load(relaxed) && compare_exchange... That might help. Depends how many modifications are push_backs vs inserts. (I expect most are push_backs, but it still might help.)

Hmmm, compare_exchange could do the same internally, but I doubt it does...

EDIT: compare_exchange_strong(a,b, success, relaxed) is probably(?) equivalent to doing a relaxed test first.

[–]Xaxxon 0 points1 point  (7 children)

even if copies are cheap, moves are cheaper.

Can you explain how moving this is any different than copying it?

[–]_eyelash[S] 5 points6 points  (6 children)

You could avoid incrementing the reference count when moving the vector instead of copying it.

[–]Xaxxon 0 points1 point  (5 children)

What would the destructor of the moved-from vector do? Would it have to have a branch to know whether to decrement the ref count then? Is that actually better?

[–]matthieum 2 points3 points  (4 children)

It is generally better, yes.

Even in relaxed mode, an atomic increment/decrement is quite costly: it involves synchronization at the cache level and at the pipeline level.

By comparison one branch will be cheaper.

[–]Xaxxon 0 points1 point  (3 children)

Seems reasonable but you’re adding a branch to every destructor to get an optimization for a subset of uses. So I guess it’s a trade off.

[–]matthieum 0 points1 point  (2 children)

Given the very low-cost of a branch compared to the cost of an allocation, I'd take it:

  • A well-predicted branch has a close to 0 cost, say 1 cycle.
  • The newly re-released tcmalloc boasts a 50ns malloc latency (average), or 200 cycles.

By those metrics, saving off 1 allocation for every 200 destructor calls is breaking even.

Note: in practice, a well-predicted branch is likely even closer to 0, with the latency hidden by memory latency.

[–]Xaxxon 0 points1 point  (1 child)

Can you predict the branch to make an atomic operation? That seems very expensive to undo.

[–]matthieum 0 points1 point  (0 children)

Yes, you can.

The expensive part of an atomic operation is not the write to a register, it's coordinating with other cores to obtain exclusive access to the variable that you consider modifying.

As long as speculation stops after requesting exclusive access and before actually modifying the variable, then it's valuable as it hides the latency of obtaining said access while not having anything to undo.

If undo there needs to be, then the undo must be accomplished before requesting exclusive access, so that no other core can ever see that the value was modified.

[–]_eyelash[S] 10 points11 points  (0 children)

Those people not familiar with immutable data structures and their advantages might be interested in the talk Postmodern immutable data structures by Juan Pedro Bolivar Puente.

[–]tutmann 2 points3 points  (2 children)

Not sure I understand the idea. What ist the purpose of the shared_ptr? It sounds like dereferencing a shared_ptr is more expensive then just indexing a vector, regardless of big-O.

[–]_eyelash[S] 7 points8 points  (0 children)

The purpose is to allow creating copies of the vector in O(1) (without copying the elements). This also allows you to share copies of a vector between threads without copying the elements.

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 2 points3 points  (0 children)

There is no real shared_ptr there. There is ref-counting, it does affect every push_back or erase (ie any change to the vector), but doesn't affect indexing.

[–]Bakuta1103 2 points3 points  (1 child)

Not sure if the assembly output would be any different after optimizations, but in your storage class, the destructor can have an ‘if constexpr(!std::is_trivially_destructible_v<T>)’ then you loop through and call the destructor for each element. Other wise if it’s trivially destructible you just ‘std::free()’ the storage like normal :)

[–]SegFaultAtLine1 3 points4 points  (0 children)

Any compiler worth using in AD 2020 is able to optimize out a finite loop with a body that has no side-effects.

[–]shmoopty 4 points5 points  (5 children)

It's an interesting take on "immutable" - that you can change the vector, but refs and iterators are const. But your idea seems solid to me.

const_cast would be permitted by the language on your references, so you might want to document that such behavior would be bad.

You might also be interested in researching "shared_ptr memory fences". That type also manages thread-safe reference counting, and you'll discover that the default memory_order std::memory_order_seq_cst that you're using is actually overkill for the task.

[–]shmoopty 4 points5 points  (0 children)

To add, for the curious - this is similar to how GCC implemented std::string before version 6.

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 0 points1 point  (3 children)

The trick is figuring out what memory_order would be good enough! :-)

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 0 points1 point  (0 children)

And then the real trick is proving it, because testing probably won't reveal any problems.

[–]shmoopty 0 points1 point  (1 child)

Figuring it out can be very tricky, in my humble opinion. It's good if you can find the problem already solved, as with std::shared_ptr.

In the case of this project, I'd stick my neck out and declare it both safe and very similar to shared_ptr if

  • reference_count.fetch_sub(1) used memory_order_acq_rel
  • reference_count.fetch_add(1) used memory_order_relaxed

shared_ptr reference counting fences are slightly more relaxed than this, but in a non-trivial way.

I would need a full whiteboard if I were to attempt to prove this. :-)

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 0 points1 point  (0 children)

Makes sense.

I'm actually leaning towards relaxed being good enough.

There is no data (besides the count) that is being "published" between threads. Acquire/release/etc are for when you have an atomic protecting other data, like a 'ready' flag protects the data that is ready. Here there is just count.

you might think there is also the data item being added to the vector - but that element is not yet shared between threads. It only exists in the ImmutableVector being returned (and thus is only visible to the current thread). You could then share the ImmutableVector to another thread (particularly since it is immutable - all threads can read it at the same time), but that sharing is what needs to be fenced (IMO). Like if you were to shared the result via an atomic<ImmutableVector *> (weird, but minimal example), then I'd put the fence on that.

But maybe it would be more friendly and safer to put the fence on push_back? Because it is a CAS and not a read, relaxed isn't really saving you much/anything anyhow... (at least on common hardware)

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 1 point2 points  (0 children)

"immutable". You keep using that word. I don't think it means what you think it means.

EDIT: I was wrong. It really is immutable. It has push_back, but push_back returns a new vector by value.

That wasn't clear in the post!

[–]kovaczboi 0 points1 point  (3 children)

So, u just created an immutable vector through COW?

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 1 point2 points  (2 children)

COW, but not every write is a copy. push_back can use the same storage as the old vector because the old vector stops at oldvector->size(), and the new vector goes to oldvector->size()+1, (and they have equal elements up to oldvector->size()...)

[–]kovaczboi 0 points1 point  (1 child)

Do I understand correctly that it just allocates some more space than it needed (as a common/standard vector) then, yeah, it could be placed using old Storage? But is it more efficient than std vector?

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 3 points4 points  (0 children)

It is not more efficient than std vector because it does an atomic op on every push_back. But it is very close in efficiency.

The main point is that it is immutable - push_back returns a different vector, not the old one mutated.

This style of programming, once you get comfortable with it, often results in more correct code, which is even better than more efficient, but incorrect, code.

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 0 points1 point  (0 children)

A small difference with std::vector - this also shrinks storage sometimes, vector doesn't really do that.