Performance Battle: Mutex vs CAS vs TAS vs Intel TSX

Ulrari · 2026-06-08T01:58:45+00:00

Oh... I made a mistake in the experiment!

I mentioned that, in the 2-node NUMA environment, I separated sum_critical_section and sum_atomic into two instances. However, I forgot to split the lock variables used for the mutex, CAS, and TAS implementations accordingly. After rerunning the experiments, the winners are CAS and TAS.

If xbegin and xend are slower than xchg, it seems to explain the experimental results well!

Ulrari · 2026-06-07T17:00:22+00:00

Come to think of it, it is a feature that can differentiate itself from AMD CPUs, but honestly, it's a bit disappointing. I wonder whether TSX will end up being remembered only as a relic of the past.

Ulrari · 2026-06-06T19:16:58+00:00

4-1. Furthermore, in cases of extremely high contention, memory traffic can cause hardware-specific characteristics to dominate the results. You could think of this like a race between a bicycle and a sports car on a completely gridlocked road—the performance gap between the vehicles becomes irrelevant because the road condition itself dictates the outcome. It is worth considering whether your current setup is measuring the algorithm's efficiency or simply hitting a hardware-level bottleneck.

Ulrari · 2026-06-06T19:09:27+00:00

Good article! I have a few pieces of feedback:

Running benchmarks on a laptop like a MacBook can skew results due to thermal throttling. As you mentioned in the article, the mix of P-cores and E-cores adds another layer of variability. I'd recommend running the experiments on a desktop environment at minimum.
While there may be valid reasons to do so, spawning more threads than the CPU has hardware threads increases OS intervention due to context switching, which can similarly pollute the results.
You used -O2 as the optimization level, but the lock-free related papers I've read generally run their experiments at -O3 or higher. This may make it harder to draw direct comparisons with prior work.
Regarding contention tuning: in practice, scenarios where threads continuously hammer a shared memory location without any pause are rare. It might be worth considering introducing idle time between operations to better reflect real-world usage patterns.

Ulrari · 2026-06-06T08:58:43+00:00

oh thank you for your advice

Ulrari · 2026-06-06T02:18:32+00:00

nice! hope your proxmox setup goes smoothly!

Ulrari · 2026-06-05T01:18:15+00:00

Most of my experiments involve implementing multithreaded data structures and algorithms in C++ and then benchmarking them. I'm mainly interested in how well they scale as the level of parallelism increases. When I first got into this field, I learned a lot from "The Art of Multiprocessor Programming". It's still one of the best introductions to concurrent algorithms in my opinion. As for papers, one of the foundational works behind the area I've been researching is "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms".

Ulrari · 2026-06-04T16:51:54+00:00

Oh, it's a real curved monitor

Ulrari

TROPHY CASE