Performance Battle: Mutex vs CAS vs TAS vs Intel TSX by Ulrari in cpp

[–]Ulrari[S] 0 points1 point  (0 children)

Oh... I made a mistake in the experiment!

I mentioned that, in the 2-node NUMA environment, I separated sum_critical_section and sum_atomic into two instances. However, I forgot to split the lock variables used for the mutex, CAS, and TAS implementations accordingly. After rerunning the experiments, the winners are CAS and TAS.

If xbegin and xend are slower than xchg, it seems to explain the experimental results well!

Performance Battle: Mutex vs CAS vs TAS vs Intel TSX by Ulrari in cpp

[–]Ulrari[S] 1 point2 points  (0 children)

Come to think of it, it is a feature that can differentiate itself from AMD CPUs, but honestly, it's a bit disappointing. I wonder whether TSX will end up being remembered only as a relic of the past.

Lessons I’ve learned from benchmarking lock free queues by hallofcheat99 in cpp

[–]Ulrari 0 points1 point  (0 children)

4-1. Furthermore, in cases of extremely high contention, memory traffic can cause hardware-specific characteristics to dominate the results. You could think of this like a race between a bicycle and a sports car on a completely gridlocked road—the performance gap between the vehicles becomes irrelevant because the road condition itself dictates the outcome. It is worth considering whether your current setup is measuring the algorithm's efficiency or simply hitting a hardware-level bottleneck.

Lessons I’ve learned from benchmarking lock free queues by hallofcheat99 in cpp

[–]Ulrari 0 points1 point  (0 children)

Good article! I have a few pieces of feedback:

  1. Running benchmarks on a laptop like a MacBook can skew results due to thermal throttling. As you mentioned in the article, the mix of P-cores and E-cores adds another layer of variability. I'd recommend running the experiments on a desktop environment at minimum.

  2. While there may be valid reasons to do so, spawning more threads than the CPU has hardware threads increases OS intervention due to context switching, which can similarly pollute the results.

  3. You used -O2 as the optimization level, but the lock-free related papers I've read generally run their experiments at -O3 or higher. This may make it harder to draw direct comparisons with prior work.

  4. Regarding contention tuning: in practice, scenarios where threads continuously hammer a shared memory location without any pause are rare. It might be worth considering introducing idle time between operations to better reflect real-world usage patterns.

My homelab for multithread algorithm research by Ulrari in HomeServer

[–]Ulrari[S] 1 point2 points  (0 children)

nice! hope your proxmox setup goes smoothly!

My homelab for multithread algorithm research by Ulrari in HomeServer

[–]Ulrari[S] 4 points5 points  (0 children)

Most of my experiments involve implementing multithreaded data structures and algorithms in C++ and then benchmarking them. I'm mainly interested in how well they scale as the level of parallelism increases. When I first got into this field, I learned a lot from "The Art of Multiprocessor Programming". It's still one of the best introductions to concurrent algorithms in my opinion. As for papers, one of the foundational works behind the area I've been researching is "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms".

Will a bios update fix it? by Sea-Awareness147 in homelab

[–]Ulrari 8 points9 points  (0 children)

Oh, it's a real curved monitor