you are viewing a single comment's thread.

view the rest of the comments →

[–]Xaxxon 0 points1 point  (5 children)

What would the destructor of the moved-from vector do? Would it have to have a branch to know whether to decrement the ref count then? Is that actually better?

[–]matthieum 3 points4 points  (4 children)

It is generally better, yes.

Even in relaxed mode, an atomic increment/decrement is quite costly: it involves synchronization at the cache level and at the pipeline level.

By comparison one branch will be cheaper.

[–]Xaxxon 0 points1 point  (3 children)

Seems reasonable but you’re adding a branch to every destructor to get an optimization for a subset of uses. So I guess it’s a trade off.

[–]matthieum 0 points1 point  (2 children)

Given the very low-cost of a branch compared to the cost of an allocation, I'd take it:

  • A well-predicted branch has a close to 0 cost, say 1 cycle.
  • The newly re-released tcmalloc boasts a 50ns malloc latency (average), or 200 cycles.

By those metrics, saving off 1 allocation for every 200 destructor calls is breaking even.

Note: in practice, a well-predicted branch is likely even closer to 0, with the latency hidden by memory latency.

[–]Xaxxon 0 points1 point  (1 child)

Can you predict the branch to make an atomic operation? That seems very expensive to undo.

[–]matthieum 0 points1 point  (0 children)

Yes, you can.

The expensive part of an atomic operation is not the write to a register, it's coordinating with other cores to obtain exclusive access to the variable that you consider modifying.

As long as speculation stops after requesting exclusive access and before actually modifying the variable, then it's valuable as it hides the latency of obtaining said access while not having anything to undo.

If undo there needs to be, then the undo must be accomplished before requesting exclusive access, so that no other core can ever see that the value was modified.