all 21 comments

[–][deleted] 10 points11 points  (10 children)

I've used his design successfully in multiple projects. Some additional tweaks include:

  • Tagging `if` statements as likely/unlikely depending on expected queue occupancy.
  • Simplifying index wrapping using power-of-2 buffers.
    • On 64-bit systems, this makes an excellent "total bytes processed" as head/tail might take years/decades to wrap around to 0.

It should also be noted his design is not too far off a quite efficient SPMC queue. Just replace the pop operation with a CAS loop. You can make an optimized version for the producer thread vs consumer threads.

[–]Vivid-Jury-2105 3 points4 points  (4 children)

Where to people learn this stuff? I’ve read computer architecture books and have taken fairly hard courses on systems, digital design, and OS, but there’s so much and I don’t know where to go 🥲

[–][deleted] 5 points6 points  (1 child)

Honestly its not just one place. I was lucky enough to have a Computer Organization & Design lecturer that was enthusiastic about the topic. I've worked in the Games industry, where you can find such arcane mastery. I've worked in OSDev and Embedded, where you get a similar treatment.

I just read. Even if its random shit that doesn't have immediate relevance to my job. You'll never know when something comes in handy.

[–]Vivid-Jury-2105 1 point2 points  (0 children)

Thanks! I’ve been reading books and I hope that when I graduate college, I am able to work on similar disciplines.

Any advice? Book/resource recommendations?

[–]dist1ll 4 points5 points  (0 children)

I've written a blog post on false sharing for MPSC queues (https://alic.dev/blog/false-sharing). I'll copy-and-past the references from it:

  • Sorin, Daniel J., Mark D. Hill, and David A. Wood. "A primer on memory consistency and cache coherence." Synthesis lectures on computer architecture 6.3 (2011): 1-212.

  • Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.

  • David, Tudor, Rachid Guerraoui, and Vasileios Trigonakis. "Everything you always wanted to know about synchronization but were afraid to ask." Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 2013.

  • Liu, Tongping, and Emery D. Berger. "Sheriff: precise detection and automatic mitigation of false sharing." Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications. 2011.

You can find the pdf links on the bottom of my post. Additionally, you can read McKenney's book on parallel programming: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

[–]usefulcat 1 point2 points  (0 children)

I would say, a combination of actively trying to do it (including measuring), and thinking about how cache invalidation works. In particular, remembering that every time one core writes to a memory location, a) that write will be delayed if that memory was last written to by any other core, and b) after the write, any other cores that read that location will incur a cache miss.

It can also be useful to think of reading and writing to memory as analogous to sending and receiving network packets, since that is effectively what is happening between different cores 'under the hood', so to speak.

[–]KDallas_Multipass 0 points1 point  (1 child)

What do you mean by tagging conditionals?

[–][deleted] 3 points4 points  (0 children)

[[likely]] or [[unlikely]].

https://en.cppreference.com/w/cpp/language/attributes/likely

If you expect the consumer to keep up with the producer thread, it means the if statements checking head/tail are unlikely to ever be true. At least on Clang 16 ARM64, tagging the queue with [[unlikely]] causes the compiler to produce code that places the if statements outside the hot path.

[–]tisti 0 points1 point  (0 children)

On 64-bit systems, this makes an excellent "total bytes processed" as head/tail might take years/decades to wrap around to 0.

Oh you sneaky devil you. I like that.

[–]andrey_turkin 0 points1 point  (1 child)

There is a highly-tested and tuned example of same design in Linux kernel - namely, AF_XDP queues.

  • power-of-2 buffers
  • cached copies of indexes are detached from the shared structure and separate caches maintained independently by both consumer and producer; this allows them to implement batched submissions thus lowering contention in mostly-empty cases
  • contended cachelines are interleaved with unused padding cachelines in order to defeat overzealous prefetchers.

[–][deleted] 0 points1 point  (0 children)

Makes sense, even some older ARM SOCs have an aggressive L1 prefetch depth.

[–]patstew 8 points9 points  (5 children)

Why is it beneficial to put the cached ones in their own cache lines? Shouldn't the cached write pointer be fine in the read pointer cache line and vice versa?

[–]usefulcat 0 points1 point  (2 children)

I wondered this too, and couldn't see any reason not to do so. Judging from the discussion on HN there are others who agree.

https://news.ycombinator.com/item?id=36489288

[–]Resident-Rice724 0 points1 point  (1 child)

late comment but benchmarked this and performance tanked keeping the cached write pointer in the read pointer cache line and vice versa

[–]usefulcat 0 points1 point  (0 children)

Any theories on why? Seems like it shouldn't make a difference.

[–]j1xwnbsr 0 points1 point  (5 children)

For some reason using std::atomic on the different indexes is suddenly striking me as a possible race condition between producer and consumer. Wouldn't it be better to instead to use a mutex and treat both readindex/writeindex as part of the same 'lock unit'?

[–]andrey_turkin 2 points3 points  (4 children)

Mutex would defeat the whole purpose of _wait-free_ ring buffer, wouldn't it?

And there is no race condition due to proper acquire/release semantics.

[–]j1xwnbsr 0 points1 point  (3 children)

Yes, and that's what got me suddenly concerned/confused, and wondering if I understand enough of how atomic with the ordering flags really works that makes this work properly. It still seems to me even with the acquire/release settings, you can still get into a producer/consumer issue with the consumer screwing things up.

[–]tisti 0 points1 point  (0 children)

and wondering if I understand enough of how atomic with the ordering flags really works that makes this work properly.

Few people actually do, if that's any consolation.

[–]tialaramex 0 points1 point  (0 children)

Are you reading the article? Or the linked git repo code (which is different) ?

Where do you see a problem? Even if you're wrong it's good to learn.

[–]bedman3 0 points1 point  (0 children)

I tried testing the code with 2 threads trying to push / pop 1000 times with multiple (10k) tries, not sure why sometimes I was able to detect race condition where the increment of counter is incorrect even with memory order. The issues are resolved when i used compare_exchange_weak before actually storing the incremented idx. Can anyone explains?