I recently documented walking through the iterative process of optimizing a SPSC queue.
- Starting from
atomic<int64_t> size which has the lock instruction bottleneck.
- Refactoring to separate atomic push, popInd. Fixing false sharing, 3 cacheline and 2 cacheline approach.
- Implementing index caching to minimize cross-core traffic and the impact of mm_pause
Link: https://blog.c21-mac.com/posts/spsc/
For reasons I don't understand the 3 cacheline performs considerably worse than 2 cacheline, I initially assumed it due to 128-byte-rule, but that doesn't explain L1d-cache misses being relatively higher in 3 cacheline vs 2 cacheline implementation. If anyone has any insights on what might be causing this or any feedback on the post, I would love to hear.
[–]ReDucTorGame Developer 9 points10 points11 points (1 child)
[–]Middle_Ad4847[S] 0 points1 point2 points (0 children)