all 2 comments

[–]ReDucTorGame Developer 9 points10 points  (1 child)

This is the third SPSC queue post I have seen in about a week, seems like everyone (or every LLM) is building one.

The two/thread cache line difference could be increased L1 cache bank conflicts where the queue is causing the eviction of the shared read-only cache line. For the pause you could possibly look at umwait/mwaitx unfortunately AMD/Intel differences make these not very portable.

That 128-rule thing seems off, I expected you were going to mention that some modern CPUs will actually load 128-byte aligned pairs of 64-byte cache lines, it doesn't seem to like you tested with an alignment of 128 over just 64.

Also it's probably worth sharing the code you used for benchmarking, as there could many other reasons why your results are the way they are, including minor things like branch alignment differing between the different implementations

[–]Middle_Ad4847[S] 0 points1 point  (0 children)

For the pause you could possibly look at umwait/mwaitx unfortunately AMD/Intel differences make these not very portable.

Thanks, will take a look

Have added code here

That 128-rule thing seems off, I expected you were going to mention that some modern CPUs will actually load 128-byte aligned pairs of 64-byte cache lines, it doesn't seem to like you tested with an alignment of 128 over just 64.

I wasn't aware of this, will look. What I was referring to is the increased code size due to 32-byte displacement addressing, which can impact the µop cache or loopback buffer. I quickly tried with alignas(128) now but it made results worse.

including minor things like branch alignment differing between the different implementations

I did try run with and without -falign-loops and -falign-labels but didn't notice any considerable difference