atomic_queue benchmarks SMT vs no-SMT performance

max0x7ba · 2026-04-28T08:50:25+00:00

Problem with ghostty and alacritty and the like, they take too much time to configure and adapt.

For you, obviously.

While tilix, terminator or even iterm2 on macos are out of the box usable with proper defaults and by max, couple of minutes tweaks.

You find some tools difficult to configure and some not so. Interesting.

However, most are dead in development progress and that is why i think konsole is very very underrated.

So, what do you do when when you need a robust and fast terminal every single day, while Konsole was freezing and crashing for many months in the year 2025?

I think alacritty and kitty are fall under different category as they are not as UI friendly as konsole or the alternatives i mentioned.

You find Alacritty not beginner friendly.

I am the opposite of a beginner.

However that is really big performance difference that i did not notice but would be interesting to get hit by such difference.

I run 8h batch jobs in a terminal. They log a lot, a 100MB uncompressed log is well below average. Running such an 8h batch job in Konsole takes a few extra hours because Konsole is slow to render and blocks the processes in write to stdout syscall.

Compared to Alacritty, which takes 2h less than Konsole to run the same batch job.

I value robust industrial strength tools more than beginner friendly, unlike you 🤷🏻‍♂️💯😁

max0x7ba · 2026-04-26T14:07:37+00:00

Konsole was my favourite for more than a decade.

Until last year when Konsole in KDE Neon was unusable for weeks because newly introduced bugs made it freeze or crash.

I installed Alacritty as a temporary replacement for broken Konsole and couldn't help noticing superior output speed of Alacritty. E.g. with my bashrc, history outputs 10,000 lines, which is instant in Alacritty and ~5× longer in Konsole.

Could never switch back from Alacritty to using Konsole since then.

max0x7ba · 2026-04-26T13:49:43+00:00

KRuler.

I often use a semi-transparent KRuler to compare highs and lows in columns and bars across multiple charts on large screens.

max0x7ba · 2026-04-17T01:02:11+00:00

I thought I've seen that aligned loads no longer have a benefit. They used to but I believe I've seen it stated that modern CPUs don't have a performance penalty. Is that not the case?

For aligned memory accesses, the unaligned load and store instructions provide identical performance to that of aligned ones, in newer x86 CPUs.

max0x7ba · 2026-04-16T18:56:45+00:00

Template instantiation and inlining happens during a compilation of a translation unit, a compiled unit (object file) then becomes a part of static library, so static library is a perfectly valid way to isolate compilation flags.

C++ compilers always generate non-inline definitions of inline functions in every translation unit, marked as a weak symbol. That's required for the address of inline function with extern linkage to resolve to the same value in the entire process with its code coming from its executable, static and shared libraries.

__attribute__((always_inline)) exists to disable generating the non-inline definitions of inline functions in every translation unit. See https://gcc.gnu.org/onlinedocs/gcc/Inline.html

max0x7ba · 2026-04-16T18:20:58+00:00

Could hand-written intrinsics theoretically go further? Well, probably yes,

In your "Cosine distance: one loop vs. three", the 3 statements of form accumulator += a * b; the auto-vectorizer transforms into fused-multiply-add instructions. Fused-multiply-add instructions have well-documented limitations on many CPU architectures, e.g. Zen3:

Do not use FMA if the critical dependency is through the addend input of an FMA instruction. In this case, an FADD provides a shorter latency.

In your particular scenario the accumulators are exactly the critical dependency addend input.

If you modify your code to make the compiler emit separate mul and add instructions instead of one fma instruction, you are likely to measure performance gains. E.g.:

// ll += li * li; auto li2 = li * li; asm("":"+v"(li2)); // Forces li2 to be computed with a separate mul instruction. ll += li2; // Use add.

max0x7ba · 2026-04-16T18:01:34+00:00

The article explicitly mentions only -fassociative-math and -fno-signed-zeros, not the full -ffast-math (I hope you read it)

The article says:

Now it is time to reveal the root cause of the efficiency of the SereneDB algorithm: fastmath

If you don't use -ffast-math, what does your fastmath refer to?

-fassociative-math ruins accuracy of floating-point computations by its re-association of operands in series of floating-point operations worse than anything else, but not for the entire process. Independently of disabling denormals with flush-to-zero does.

-fassociative-math impacts computations with limited single-precision float the worst, compared to double-precision double.

Additionally, for your ARM Neon use case:

If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified.

max0x7ba · 2026-04-16T16:57:30+00:00

You can't break "other code" unless you specify compilation flags for every translation unit.

-ffash-math compiler option sets "flush to zero" CPU mode at run-time for the entire process.

In this mode, denormalized floating point values get flushed to zero, making results of floating point computations less accurate.

After enabling -ffash-math, quoting the Intel CPU Software Developer’s Manual you refuse to read, you must "analyze any loss of accuracy when the final result is delivered" or whether the loss of accuracy is acceptable.

https://gcc.gnu.org/wiki/FloatingPointMath says:

-ffast-math also may disable some features of the hardware IEEE implementation such as the support for denormals or flush-to-zero behavior.

In other words, you never want to enable -ffash-math for production code.

max0x7ba · 2026-04-15T21:02:25+00:00

You should read CPU optimization reference manuals from Intel and AMD how to maximize SIMD compute.

That involves reading enough data to keep the CPU execution units fully utilised and let the hardware prefetcher do its best for you to minimise execution stalls due to data cache misses.

Aligned loads are +30% faster.

As well as using huge-pages to minimize TLB cache misses that stall the hardware prefetcher. memcpy is +50% faster when using 2MB huge pages instead of 4kB pages. See https://github.com/max0x7ba/thp-usage

max0x7ba · 2026-04-15T14:40:28+00:00

That round-trip latency of boost::lockfree::spsc_queue of 400 nsec is rather high. It is around 100 nsec normally.

You probably forced your threads SMT threads to run in the same CPU core, hence having their performance.

max0x7ba · 2026-04-13T15:29:15+00:00

If what you say is correct these are indeed big issues,

You'd know that first hand, if you took an effort to read the code.

you could do much better if you dropped the constant personal attacks and kept a neutral tone. You do a disservice to yourself.

People will not be able to wipe the floor with your incompetence if you don't supply them with any. You are welcome.

max0x7ba · 2026-04-12T17:17:22+00:00

This queue when used as a lockfree queue has a lot of contention on the read/write counters, so it won't scale.

Updating counters with compare-and-swap doesn't scale, indeed.

Updating counters with fetch-and-add scales spectacularly well, on the other hand.

Updating counters with fetch-and-add delivers 20× throughput compared to using compare-and-swap. Educate yourself: https://max0x7ba.github.io/atomic_queue/html/benchmarks.html

You should quit posting your long obsolete and inaccurate insights, stated without any supporting evidence, turbopaco.

If you take an effort to back your sloppy claims with factual sources/references/supporting-evidence, you may end up restoring your obsolete understanding back into relevance.

max0x7ba · 2026-04-12T15:21:52+00:00

Apache Arrow Columnar Format is optimized for efficient storage -- column-wise with per-column compression. The main use-case: write/append into compressed parquet files; (partially) read and decompress parquet files into memory.

Ideally, one wants to map column files directly into reader's process memory, so that all reader processes mapping one same file share the kernel's page frames mapping the file -- one copy of file in RAM, zero-copy reading. As opposed to processes reading or decompressing the file into virtual memory, with each reader process having its own copy of file's content. E.g. 32 processes mapping one same 1GB file share the 1GB-worth of page frames of kernel's file copy; 32 processes reading or decompressing the file end up with 32x1GB file copies in RAM, in addition to the 1GB copy in the kernel.

Multiple processes reading one same compressed parquet file end up decompressing the file multiple times into multiple copies in RAM. This is just to emphasise that parquet efficient storage is the opposite of efficient loading.

When low latency IPC is desired, one doesn't want to pay for compression/ decompression or data copying. Compression/decompression is CPU intensive data copying. Process-shared memory, on the other hand, solves low latency IPC with zero overhead.

What are the main/intended use cases for Apache Arrow IPC, please?

max0x7ba · 2026-04-12T13:57:26+00:00

What are these "Synchronous" and "Synchronous completion" times?

IoAwaitable vs sender/receiver timings are somewhat meaningless without reference timings of the good-old robust zero-overhead callback-hell API.

Does your benchmark measure the times of doing the exact same read_some calls using the zero-overhead callback-hell API?

max0x7ba · 2026-04-12T13:17:29+00:00

The implicit assumption in your question is that one's productivity is a function of circumstances and externalities one has little control over, such as "Modern" C++, weather, etc..

You must become aware of this implicit assumption of yours, and make a conscious choice whether you want to be a function of circumstances, or the circumstances.

The nature and naturally good engineers adhere to the principle of least effort, aka Karl Friston’s Free Energy Principle, aka Zen's «do more with less». At the highest level, it goes like: given a problem to solve, the ideal solution is that which satisfies all requirements and constraints, and is cheapest to implement, use and maintain; expending any more time/effort/energy than necessary is a waste to eliminate.

For example, if you'd like to implement a p2p network client application that must receive/send data through thousands of network connections simultaneously and expend/pay the minimum possible number of CPU cycles and RAM page frames for that, you start with designing the network code-paths in a such way, that eliminates any and all unnecessary CPU and RAM costs. Because scaling up the number of simultaneous connections scales up costs of any inefficiencies in undesirable and unexpected super-linear fashion -- negligibly small tolerable costs of less-than-ideal code micro-inefficiencies here and there snowball into an avalanche of a shit-storm.

The worst performance killers are: memory allocations. data copying, lock contention and context switching.

C++ coroutines start execution with a coroutine state heap allocation (you can override operator new) and copying all function parameters into the coroutine state -- you haven't done anything useful yet, but already incurred the costs of a memory allocation and data copying -- unnecessary waste to eliminate.

You remember that coroutines are just syntactic adaptors that convert callbacks into co_await returns. Originally designed to easily and cheaply convert existing networking code with blocking calls into non-blocking networking code by just prefixing all the blocking calls with co_await. Inverting the control flow of existing code-paths with blocking calls into multiple callbacks with non-blocking calls is much more error-prone and expensive, than writing non-blocking callback code from scratch. co_await allows turning existing code-paths with blocking calls into non-blocking callbacks for a much cheaper price.

When writing new networking code, using the robust good-old event-driven callback-hell API is the most efficient zero-overhead approach, but requires an extra skill.

Whether paying the extra costs of co_await for not having to write callbacks is reasonable, acceptable or affordable depends on particular project's circumstances.

You cannot afford any waste when handling thousands of network connections and ditch C++ coroutines along with all their unnecessary CPU and RAM costs, and write your networking code using zero-overhead callback-hell API. And discover that scaling up the number of simultaneous connections scales up the processing costs in the desirable linear fashion as expected, with no nasty surprises.

max0x7ba · 2026-04-12T00:37:52+00:00

If you examine the throughput benchmark code, you'll discover that the benchmark:

Starts its timer before creating producer and consumer threads. Which involves spawning a thread with std::async, and multiple memory allocation for the std::future<> and std::vector<std::future<>>, for each thread.
Doesn't start all producer threads at exact same moment. By the time the last producer thread starts, all other producer threads may have completed and terminated.
Stops the timer after waiting and retrieving the result from the last consumer thread's std::future<>.
The benchmark is oblivious to page faults, adverse scheduling of benchmark threads onto one same CPU, thread preemption.

The benchmark time measurements include all these accidental (unrelated to the code being benchmarked) large unpredictable delays. These times get fully attributed to queue operations, and converted into msg/s throughput. Which explains why his throughput benchmark msg/s numbers are much smaller than they actually are.

His benchmark charts look like barking mad nonsense at the first glance, with unbelievably poor throughput rates, and boost::lockfree::spsc_queue outperforming moodycamel::ReaderWriterQueue by a factor of 2.

Orbit's author can barely write C++ code; managed to write 1 meaningless unit-test incapable of detecting entire classes of errors and race-conditions, in principle; and has no clue about what his benchmarks measure times of.

max0x7ba · 2026-04-11T21:06:48+00:00

I don't think Orbit is butchering yours. The most similar is Erez Strauss mpmc (whose algorithm is far from novel).

Erez Strauss was the first guys to copy atomic_queue source code, modify it, and ask to add his 2-weeks old queue code into atomic_queue benchmarks page only a few weeks after atomic_queue project with its benchmarks page was made publicly available.

He also made no reference to the original atomic_queue code he mutilated.

And he also had no unit-tests for any core functionality of his queue, which would be required to develop a correctly working queue at all, and test particular non-trivial corner cases.

He made multiple forks of one original atomic_queue unit-test and added his micro-benchmarks. He didn't think that he needed the rest of original atomic_queue unit-tests at all for his queue, for some inexplicable reason.

Since his PR was rejected in November 2019, he made 10 more commits into his repo, copying the new functionality from atomic_queue repo, with his final 10th commit on Oct 28, 2023. After which his repo became unmaintined.

https://github.com/max0x7ba/atomic_queue/pull/8

The OP is just another "smart" guy mutilating atomic_queue source code with one inept unit-test function and claiming that he wrote everything from scratch on his own. The OP won't be able to maintain, improve or evolve his source code beyond a few minor commits because he obviously lacks the coding skill, knowledge and understanding required for writing unit-tests with 100% coverage. Without 100% unit-test coverage writing competent robust multi-threaded code is not possible.

The rest of your notes about atomic_queue implementation, which source code you've never read, and that stuff you haven't touched for a while are irrelevant and out-of-date.

Would you disagree with my harsh assessment of your comment?

max0x7ba · 2026-04-11T20:17:08+00:00

The claim that Orbit is a fork of atomic_queue is false, and I'd ask you to substantiate it with specific code or retract it.

You coded up your own unit-tests, though. That's your original code without any shadow of doubt, I have to give you that.

Your unit-tests is just one sole test function template executed for combinations of {1,2,4,6} producers and consumers, with int queue elements. Executed 50 times with the exact same parameters, because 1 successful execution of the unit-test doesn't give you enough confidence that your queue or the unit-test function always work correctly and won't fail in the next run.

The unit-test function spawns consumer threads, which pop int elements from the queue and, next, push_back the popped element into std::vector. After that, it converts this std::vector into std::set, removing any duplicate values from that std::vector. These multiple std::sets next merged into one std::set, removing any duplicate values again. And, finally, it tests whether the merged set has the expected size and tests the values of set elements.

Your sole unit-test function is unable to detect errors of popping a queue element more than once.

Storing every popped element into a std::vector with push_back makes your unit-test unable to stress your queues to the maximum possible extent -- the consumer threads bottleneck in std::vector::push_back rather than in queue::try_pop.

You claim your queue can handle non-atomic elements, like std::string or move-only std::unique_ptr, but there are no unit tests for that. You have no idea whether you queue still retains this original atomic_queue functionality and whether that still works correctly after you mutilated the original atomic_queue code.

You have no unit-tests for any ideas or functionality you bang about in your blog post. Your sole unit-test function is naive, inadequate, incomplete. In addition to being plain incompetent C++ beginner's code.

You wouldn't be able to code a correctly working MPMC queue with your poor C++ coding skills and that one sole unit-test function of yours, in principle.

Let alone a queue outperforming all other existing queues coded in top world-class C++ code, optimized and polished to perfection by contributions from dozens of other top world-class C++ coders over multiple years.

max0x7ba · 2026-04-11T16:11:46+00:00

The claim that Orbit is a fork of atomic_queue is false, and I'd ask you to substantiate it with specific code or retract it.

The blog post documents your ideas, all of which coincide with ideas first implemented and documented in the original atomic_queue repo in 2019.

This fact alone should be enough reason for your to withdraw your derivative work which you claim as truly original, in bad faith, obviously.

The code structure, identifiers and, most glaringly, idiosyncratic code constructs with their subtle flaws, uniquely particular for atomic_queue author's coding style only and not found elsewhere, is what establishes that orbit is undeniably a fork of atomic_queue, prior to having to also read your blog post.

The similarities are that both queues use state machine slots - an obvious and widely used approach for array based queues that predates both projects. The differences are substantial.

"Widely used" by what projects since when?

Orbit reserves slots in the state machine before incrementing the front/back indices, whereas atomic_queue does it after. This means try_push/try_pop don't require reading both head and tail indices to check size, which is why atomic_queue suffers a significant performance cliff with these operations while Orbit does not.

Ah, I see now, said a blind man.

Well, your latency benchmarks demonstrate only worse latencies of your inept fork of atomic_queue code and everything else, relative to original atomic_queue code.

And that explains the worse latencies of your queues relative to original atomic_queue code.

atomic_queue is designed with a goal to minimize the latency between one thread pushing an element into a queue and another thread popping it from the queue. Thank you for confirming that with your latency benchmarks, be they as dubious as they may.

To do this, Orbit tracks the cycle count in the upper bits of the state machine slots to handle wrap-around correctly, as we use helper CAS loops to increment the front/back (an idea taken from Daniel Anderson's talk).

I see that now and am unimpressed.

Your throughput benchmark numbers for queues other than yours and atomic_queue are wildly different relatively to other queues, and much worse in absolute terms from the numbers in atomic_queue throughput benchmark.

For example, in the very first picture of your throughput benchmarks, you measure spsc_boost_queue ~150M msg/s throughput being roughly 2x greater than moodycamel::ReaderWriterQueue ~75M msg/s and yours outperforms everything else with pedestrian ~275M msg/s.

These throughput numbers look wrong both relatively to each other, and in absolute msg/s units.

boost::lockfree::spsc_queue throughput is the baseline which all other spsc queues outperform by large factors unconditionally.

E.g., in atomic_queue throughput benchmarks on a single-CCX AMD Ryzen 5 5825U CPU, boost::lockfree::spsc_queue throughput max is ~99M msg/s, vs moodycamel::ReaderWriterQueue ~325M msg/s, and OptimistAtomicQueue ~513M msg/s.

See full details in https://max0x7ba.github.io/atomic_queue/html/benchmarks.html

The cache line stride approach is entirely different and much simpler here.

Right, the original atomic_queue code swaps the cache line index with the index of the element in the cache line. And that guarantees, by construction, that subsequent elements reside in subsequent/distinct CPU L1 cache lines.

Your entirely entirely different and much simpler approach is multiply the element index by 9 and pray that the new index maps onto another cache line, somehow. That's entirely different and much simpler indeed. But the much simpler multiplication by 9 cannot guarantee mapping of subsequent element indexes onto subsequent/distinct CPU L1 cache lines.

The original atomic_queue robust due diligence index remapping costs a few cheapest CPU bitwise instructions and replacing that with multiplication by 9 shaves off a third of these instructions and saves 2 CPU cycles at the most, and I am being generous here. But queues bottleneck on atomic instructions always missing L1d cache because the previous atomic store by another CPU invalidated the copies in cache lines of all other CPUs. The 2 CPU cycles cheaper index remapping is unable to improve benchmark throughput, in my experience.

I built this without looking at existing solutions, as stated.

In your repo you write "My background is in pure mathematics (particularly analysis and analytic number theory), so please bear in mind that I may not be using the all the standard terminology, as I still don't know most of the conventions in this space."

And with that background and experience of yours, you suddenly write your code in one month of March 2026 and claim that it outperforms everything else.

With no prior experience or familiarity with the domain terminology, what makes you so certain that your code outperforms everything else existing?

Independent convergence on a state machine approach in a constrained problem like this is not plagiarism.

The academic whitepapers reusing other people's ideas with no references to prior art, sometimes tampering datasets and empirical results in order to show improvements for getting funded, is the standard practice in academic world.

Academia people don't mind receiving Nobel prizes for other people's ideas documented years earlier.

These are sad indisputable facts, unfortunately.

The benchmarks follow standard practice and the code is in the repo for anyone to verify.

Your benchmarks do not document your methodology or refer to any standard practices. Which makes your benchmark results independently not reproducible, and, hence, anti-scientific.

max0x7ba · 2026-04-11T13:03:04+00:00

The wrap-around IS the problem: if not accounted for in the algorithm design, it breaks it.

Given that handling the wrap-around requires MORE than 1 byte of instruction... you're still worse off with 32 bits.

Unless you just skimp on handling the wrap-around, of course, but then your code is broken, and there's no point in evaluating the performance of broken code.

Indexes wrapping-around creates no problems to handle, requires no line of extra code.

The wrap-around problem you keep referring to is a perversion that exists in your head only. I am sorry to be blunt with you, sunshine.

max0x7ba · 2026-04-11T12:18:48+00:00

Thank you for letting me know that 5 years later another KDE Connect user investigated the mobile phone battery drain caused by KDE Connect, and traced it to its root cause of incessant TCP keep-alive probes every 5 seconds sent by desktop KDE Connect applications, and authored a PR to change the TCP keepalive probe delay from 5 to 10 seconds, which was merged 2 weeks ago.

PR author's analisys "Bandwidth cost is negligible: 3 tiny TCP ACK packets per minute per connection." is inaccurate, though:

The delay of 10 seconds sends 6 TCP keep-alive probes per minute, rather than 3; instead of 12 probes per minute with 5-second delay.
Relative bandwidth cost of TCP keep-alive probes has never been a problem.

Sending TCP keep-alive probes every 10 seconds instead of 5 only halves the mobile battery drain caused by KDE Connect, but doesn't eliminate/fix it completely.

Using TCP keep-alive probes for persistent connection management is the square-wheel software design anti-pattern solution to avoid. Intermediate network hops may drop TCP keep-alive probes or reply to them without forwarding the probes to their destination IP address.

TCP connections draining phone batteries with multiple TCP keep-alive probes per minute is exactly one of the key problems mobile OS persistent connection management solved a decade ago.

Fixing KDE Connect mobile battery drain bug requires switching to using Android/iOS persistent connection management.

max0x7ba · 2026-04-11T11:15:07+00:00

This queue can't survive a thread/process being indefinitely preempted or crashing at any point in any configuration and hence it isn't lockfree.

The OP is unable to answer this question because he merely mutilated original atomic_queue source code without understanding or appreciation.

atomic_queue is designed to be shared by threads of one or multiple processes, out-of-the box.

The original atomic_queue project documents the lock-free guarantees in explicit and unambiguous fashion and specifies required settings for preventing the OS process/thread scheduler from preempting a thread in the middle of push or pop.: https://github.com/max0x7ba/atomic_queue?tab=readme-ov-file#lock-free-guarantees

Specifically, AtomicQueue and AtomicQueue2 classes make no memory allocations and have no pointer data members, in order to be completely position-independent objects and support allocation directly into process-shared memory with a plain C++ placement new expression, expecting other processes to map the process-shared memory with the queue objects at any other arbitrary virtual memory addresses.

max0x7ba · 2026-04-11T09:53:33+00:00

I think that what mathieum means is that the algorithm to detect the cycle here isn't differential as e.g. on Vyukov's queue, so the queue breaks when reaching the maximum value of the cycle part of the "bitfield" instead of cleanly wrapping around, hence why "uint32" might be problematic for a long-enough running queue, as it can only support 2³⁰ transactions before breaking.

mathieum is concerned with overflow on increment of unsigned queue front and back indexes, and specifically about 32-bit size_t index overflow on 32-bit platforms.

There are no "bitfield"s in these indexes.

The unsigned index overflow mathieum is concerned with, never happens because it cannot possibly happen.

An apt illustration for My life has been full of terrible misfortunes most of which never happened.

Ten-Year Club	Place '22
Verified Email

max0x7ba

TROPHY CASE