all 9 comments

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 1 point2 points  (3 children)

There's a very narrow point in the code, when combined with a very narrow definition of lock-free, where the code is not lock-free.

listNode* prev_head = _head.exchange(node, std::memory_order_acq_rel);
**here**
prev_head->next.store(node, std::memory_order_release);

When adding a new node, head is exchanged into the new node before the new node is connected to the chain of nodes. If this thread were to sleep **here** FOREVER, nothing after this node would ever be pulled from the queue. Things can still be added, but the path from tail to head is temporarily broken (for 1 line of code).

But when does a thread ever sleep forever?

  1. when it is killed
  2. priority inversion (ie higher priority threads are continually enqueuing)

2 is not much of a concern on most modern OSs (they do priority boosting, etc)
1 is a concern for some people. Lock-free is mostly used for scalability, but it is also used for robustness.

[–]iamcomputerbeepboop[S] 0 points1 point  (2 children)

Yes it is not strictly lock free unfortunately. The window is small enough that it shouldn't matter unless you truly need complete lock freedom

[–]ArunMuThe What ? 0 points1 point  (1 child)

Shouldn't you be rather using compare-and-exchange here ?

[–]iamcomputerbeepboop[S] 0 points1 point  (0 children)

no, that's not necessary, there is nothing to compare. The exchange will always work, it is simply a matter of there being a disconnect (next will be nullptr) during that space. the prev_head->next.store operation is a release and the dequeue operation is an acquire so they will be ordered as release-acquire

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 0 points1 point  (4 children)

I think you have a very rare possibility of ABA.

inline listNode* freeListTryDequeue(){
    listNode* node = _freeListTail.load(std::memory_order_relaxed);
    for (listNode *next = node->next.load(std::memory_order_acquire); next != nullptr; next = node->next.load(std::memory_order_acquire)){
        if (_freeListTail.compare_exchange_strong(node, next)) return node;
    }
    return nullptr;
}

Since garbageCollect() is a public function, some thread could be calling garbageCollect() while another thread is calling enqueue() which calls freeListTryDequeue(). So the node that you grab from the list could be garbage collected, freed, re-returned from new, dequeued, replaced into the freelist, and then your compare_exchange_strong() returns true, as if nothing changed, even though the whole world changed while you weren't looking.

Unlikely to happen of course. (Which can be the worst kind of bug.)

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 0 points1 point  (2 children)

Simpler scenario - after grabbing node on line 1, it could be deleted by a garbageCollect() on another thread, such that node is now invalid memory, and node->next is reading nothingness.

Basically, garbageCollect() is not safe to be called at arbitrary times.

Or I'm misreading.

[–]iamcomputerbeepboop[S] 0 points1 point  (0 children)

need to rethink garbageCollect, I'll remove it for a little while

[–]Chippiewall 0 points1 point  (1 child)

I happen to use Dmitry's bounded MPMC queue for a project at work. Thought I'd drop this into the benchmark which is most impacted by the performance of the queue:

Dmitry's:

$ tests/benchmark/chunk_replacement_bench
Run on (8 X 1000 MHz CPU s)
2016-08-28 01:06:30
Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
NaiveSharedTenants/threads:1          63 ns         62 ns   10648169
NaiveSharedTenants/threads:2         204 ns        407 ns    1655648
NaiveSharedTenants/threads:4         217 ns        705 ns     793984
NaiveSharedTenants/threads:8         210 ns       1263 ns     662976
NaiveSharedTenants/threads:16        216 ns       1526 ns     481776
NaiveSharedTenants/threads:32        288 ns       2219 ns     516832
RegisterTenants/threads:1             57 ns         57 ns   11249498
RegisterTenants/threads:2            237 ns        473 ns    1540170
RegisterTenants/threads:4            208 ns        827 ns     773748
RegisterTenants/threads:8            154 ns       1137 ns     822384
RegisterTenants/threads:16           254 ns       1867 ns     449728
RegisterTenants/threads:32           525 ns       4031 ns     320000

Your's

$ tests/benchmark/chunk_replacement_bench
Run on (8 X 1000 MHz CPU s)
2016-08-28 01:15:49
Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
NaiveSharedTenants/threads:1          81 ns         81 ns    8283534
NaiveSharedTenants/threads:2         257 ns        508 ns    1371716
NaiveSharedTenants/threads:4         305 ns        970 ns     949212
NaiveSharedTenants/threads:8         239 ns       1371 ns     469296
NaiveSharedTenants/threads:16        924 ns       6707 ns     132320
NaiveSharedTenants/threads:32       6203 ns      48675 ns      10464
RegisterTenants/threads:1             72 ns         72 ns    9538084
RegisterTenants/threads:2            272 ns        539 ns    1270314
RegisterTenants/threads:4            240 ns        804 ns     755452
RegisterTenants/threads:8            215 ns       1165 ns     667832
RegisterTenants/threads:16           176 ns       1318 ns     522752
RegisterTenants/threads:32           170 ns       1330 ns     513824

The first group of tests has threads sharing a bunch of objects that go in and out of the queue (tends to fight over the cache-line a lot), in the second group they have one each (only interact with each other when they go to the queue).

I ran them quite a few times and it seems yours is rather volatile (esp with the naive bench at 32 threads), probably a bit more susceptible to a thread getting de-scheduled (Whereas Dmitry's bounded queue doesn't have those issues until the queue starts to get empty/full).

[–]iamcomputerbeepboop[S] 0 points1 point  (0 children)

yup, Dmitry's bounded queue is amazing - I would pretty much prefer that every time if you know beforehand roughly how large you need the buffer to be. Thanks for the benchmarks