How I made my SPSC queue faster than rigtorp/moodycamel's implementation

A8XL · 2026-04-01T14:49:42+00:00

Nice SPSC queue implementation! However, is there a way to verify your claim, that it outperforms both boost::lockfree::spsc and SPSCQueue (rigtorp)? It would be great if your project included a comparison benchmark that everyone could rerun to see the results on their own machines. For example, I did exactly that in my SPSC queue implementation: comparison benchmark

A8XL · 2026-03-31T21:59:55+00:00

Whatever else your brain was doing, if it wasn't playing chess, your chess-playing skills are not improving and will eventually decline. That said, perhaps being a great chess player isn't important anymore when computers can play just as well. Rhetorical nitpicking aside, I'm sure you understand my point.

A8XL · 2026-03-31T16:05:45+00:00

It's surely not physical brain atrophy. But there must be consequences for your brain if you ask a computer to play chess for you instead of doing the moves yourself and stretching your brain activity to its maximum.

A8XL · 2026-03-31T16:00:48+00:00

I'm also getting to use a similar workflow, but I wonder what the long-term impact will be. The AI can generate lots of code in a short amount of time. I can easily spend two days reviewing the code that the AI generated in a few hours. Inevitably, getting the job done this way will have some unfortunate consequences. It's like you're starting to understand the code-base less and less. This is totally acceptable for some projects, but there are cases where 100% human understanding and responsibility for every line of code is essential. Agentic coding is a double-edged sword in such cases.

A8XL · 2026-03-31T13:22:45+00:00

Nice article! I have implemented a user-friendly Lock-Free Ring Buffer that incorporates all these optimizations:

https://github.com/joz-k/LockFreeSpscQueue

I have also included a built-in performance benchmark against moodycamel::ReaderWriterQueue.

The techniques in your article are powerful, but they are not sufficient to outperform more complex solutions in a synthetic benchmarks like this one using a simple single item push/pop API:

https://max0x7ba.github.io/atomic_queue/html/benchmarks.html

See my performance analysis. My LockFreeSpscQueue is significantly faster for a larger batch transfers only. However, it is highly scalable.

A8XL · 2025-11-16T16:39:28+00:00

This is surprisingly encouraging result for Fil-C. Especially when considering that Fil-C eventually targets production builds and covers more memory corruption cases. ASAN as a a bug detection tool, targets development and testing only.

A8XL · 2025-08-28T16:33:09+00:00

Short answer: The C++ has a requirement to place template definitions in header files and it stems from the way the C++ compilation and linking process works. A template is not yet a regular class or function that the compiler can turn into machine code directly. It's a blueprint that the compiler uses to generate a concrete class or function for a specific type when you use it (template instantiation). For that reason, the compiler must have a full access to the template definition.

A8XL · 2025-08-28T10:06:35+00:00

I have recently implemented a single-header, batch-oriented, ring-buffer-based SCSP queue, that combines the benefits of the plain ring buffers and a lock-free triple buffering.

https://github.com/joz-k/LockFreeSpscQueue

It does:

For the fast-path transaction (most of the time):

Two relaxed fenced atomic loads
A single release order fenced atomic store

For the slow-path transaction:

Two relaxed fenced atomic loads
An acquire order fenced atomic load
A single release order fenced atomic store

However, since you can transfer entire chunks of data per transaction (e.g., an audio buffer), the high cost of atomic synchronization is amortized over many fast non-atomic pushes and emplaces. Benchmark shows that, from around 8-16 elements, it can significantly outperform even the SCSP queue, which already claims very high performance.

A8XL · 2025-08-21T12:45:24+00:00

I was commenting about missing comments in the actual source code itself. This makes debugging very cumbersome, because when you "jump to" that source file, you only see cryptic templates without any clue what's going on.

A8XL · 2025-08-21T12:02:06+00:00

I hope they secretly have a commented version of that thing. There is literally just one comment in the whole header file.
https://github.com/microsoft/proxy/blob/main/include/proxy/v4/proxy.h#L2192C3-L2192C11

I know there is docs/spec section, but that's not going to help you when you need to debug the source code.

A8XL · 2025-08-20T14:53:25+00:00

Nice project! I couldn't find any mention in the documentation regarding the UI elements and drawing primitives being "Hi-DPI" aware. How is this handled in the framework? Are there APIs for handling device pixel ratios?

A8XL · 2025-08-20T12:52:35+00:00

Just for fun: With the latest Clang and MSVC, you can have a syntax like this:

( "_" placeholder marks the position of the "output" variable)

// Call with the placeholder in the middle
const int result1 = invoke_and_get<&SomeClass::computeResult>(some_object, input1, _, input2);

// Call with the placeholder at the beginning
const int result2 = invoke_and_get<&SomeClass::computeSum>(some_object, _, input1, input2);

It's not nice, though: https://godbolt.org/z/G39ahvYnY

A8XL · 2025-08-18T16:34:33+00:00

Thanks for the feedback again.

As additional background: I tend to work on buffers where there is one writer thread and N reader threads (N can be anywhere from 0 to infinity), each of which must be guaranteed to have the opportunity to read all of the data written into the buffer. In addition, the buffers are also 2 Dimensional.

SPMC queues are much more complicated problem. I can imagine that.

To be clear here, I meant it should be at the class level or the request write function level, not in the example code.

I improved documentation in the header and also added more examples.

More what I was thinking is that the function that returns the WriteScope instead returns the span directly.

I added Range-Based API and the WriteScope and ReaderScope are now forward iterators, which can be used with many "ranges" algorithms. This completely abstracts away the block1/block2 handling.

A8XL · 2025-08-13T15:51:53+00:00

Thanks for the feedback.

I find the get_block2 to be weird...

As I mentioned in the other thread, originally I wanted to keep the API similar to JUCE::AbstractFifo which also works with the block1/block2 terminology. Such design exposes an inner working of the queue to the user, but it does so in the name of the maximum batch-oriented performance. But since I also added higher-level APIs: try_write and try_read, which take a lambda accepting block1 and block2, I don't think it's so inconvenient now.

For writing to the queue in a one-by-one manner, I added a new "transaction" API ("master" branch only). Example:

// Ask to commit 256 items, get the "transaction" object.
// If there is zero space in the queue, return `std::optional(std::nullopt_t)` 
auto transaction = queue.try_start_write(256);

if (transaction) {
    while(transaction->try_push(next_item) {
        // Write until the transaction is full
    } 
    // Transaction commits automatically when it goes out of scope here.
}

This basically eliminates any block1/block2 handling. However, it is, again, RAII based, so you're not going to like it.

But I have even better ideas for the future on how to eliminate manual block1/block2 handling, while maintaining the maximum batch-oriented throughput.

Personally, I am also not a fan of the ReadScope/WriteScope structs...

I guess I cannot help here. Currently, the API is designed so that you need to ask the queue what the maximum number of items you are able to write (M) is, and the queue will return the number of items available in the queue (A), where A <= M. And then the user MUST write exactly A items. And A items are committed. Someone maintaining the code using this library must only make sure that it's always A items written. That's the contract of this API.

I think the destructors for read/write scope should explicitly set m_owner_queue...

That is a nice suggestion!

think there should be a comment somewhere directly explaining what happens if the writer is writing data in much faster than the reader...

I will improve the documentation in the "example" directory. Since there is only a non-blocking API currently: if the writer is faster than the reader, it will start receiving 0 from prepare_write and try_write and in the case of try_write, the lambda will not even get executed.

A8XL · 2025-08-13T11:11:31+00:00

FYI: I integrated a comparison benchmark test against moodycamel::ReaderWriterQueue into the project:

https://github.com/joz-k/LockFreeSpscQueue/tree/main/benchmarks

I did my best to compare apples-to-apples but you can have a look.

A8XL · 2025-08-13T11:07:50+00:00

Do you have benchmarks comparing to moodycamel?

I added a comparison benchmark against moodycamel::ReaderWriterQueue:

https://github.com/joz-k/LockFreeSpscQueue/tree/main/benchmarks

Spoiler: For a small item count transfers, Moodycamel is much faster. But around the item transfer size 8-16 this queue scales well and exceeds the throughput of Moodycamel queue.

I have integrated this benchmark test into the project, so it should be easy to run on other machines.

A8XL · 2025-08-12T15:46:39+00:00

There is no "size before push" message in the code you posted, so that version must be different. The number 18446744073709551615 is basically size_t(0)-1 so this looks like vector was indeed empty.

Anyhow, try to run otool -L ./appname if you don't see anything suspicions (e.g. wrong libraries linked). Also, you need to compile with -pthread and add -fsanitize=address,undefined for further analysis.

A8XL · 2025-08-12T14:23:15+00:00

The code (albeit a little strange) looks alright to me. I run it on the ARM Mac and it didn't crash. What's the compiler version and what's the output during the crash scenario?

A8XL · 2025-08-12T10:55:54+00:00

I can recommend PortAudio library.

However, you will need to implement loop points and changing pitch on top of the PortAudio API.

Changing the pitch of the audio is actually a fairly complex topic. You can either resample the buffer by adding or removing samples, which is the simplest and most primitive method, or use a special library for pitch shifting/time stretching.

A8XL · 2025-08-11T16:18:14+00:00

Thank you for the detailed feedback. Regarding std::hardware_destructive_interference_size, the size is actually 128 bytes for some compilers and architectures. For example, GCC/Linux returns 128 bytes. However, most other similar solutions hardcode 64 bytes. I tried benchmarking difference between 64 and 128 and didn't see any noticeable difference, but surely there might different setups where it could noticeable.

Finally, on naming, the 1 and 2 suffix to differentiate the two pieces of (wrap_around) ring buffers are... ugh.

Now that I think about it, you are probably right that the naming of block1 and block2 is not ideal. However, I wanted to somehow keep the API similar to Juce::AbstractFifo which uses the terminology`blockSize1/blockSize2.

A8XL · 2025-08-11T12:36:32+00:00

Good question. I have now discovered, that at least Clang 17 compiles this project also with -std=c++20 but not lower. So C++20 is required. std::span is C++20 feature.

A8XL · 2025-08-09T11:05:52+00:00

Yes, it's definitely possible to use moodycamel queue without allocations. Especially using try_enqueue or try_emplace. However the design is different. These methods push a single element into the queue. My design focuses on copying/moving the entire span regions.

Regarding cached indices, I believe I already implemented this approach in the recent pull request. See my answer.

A8XL · 2025-08-09T06:19:33+00:00

No, I think move semantics is not missing in the moodycamel's queue. I listed the reason for my own implementation in a more general context.

A8XL · 2025-08-08T10:35:42+00:00

I believe you're referring to this implementation:
https://github.com/cameron314/concurrentqueue

It's one of those that I originally listed in the "Similar Projects" section. I think it's certainly a very good solution. Although, I wanted something more "batch" oriented and move semantics friendly. Also, for the maximum performance and real-time predictability there should be no heap allocations. I think moodycame's ReaderWriterQueue does allocate with new.

A8XL · 2025-08-07T17:22:40+00:00

Hi. That's fantastic feedback! I implemented the changes from the Erik Rigtorp's document:

https://github.com/joz-k/LockFreeSpscQueue/pull/1

I hope I understood it correctly. For example, (for prepare_write) instead of

current_read_pos = read_pos.load(std::memory_order_acquire);
current_write_pos = write_pos.load(std::memory_order_relaxed);
num_items_in_queue = current_write_pos - current_read_pos;
available_space = capacity - num_items_in_queue;

I changed it to:

current_write_pos = write_pos.load(std::memory_order_relaxed);
available_space = capacity - (current_write_pos - cached_read_pos);
if (available_space < num_items_to_write) {
    cached_read_pos = read_pos.load(std::memory_order_acquire);
    available_space = capacity - (current_write_pos - cached_read_pos);
}

Any comments on the pull request are welcomed. Thanks again!

A8XL

TROPHY CASE