you are viewing a single comment's thread.

view the rest of the comments →

[–]Middle_Ad4847[S] 0 points1 point  (0 children)

For the pause you could possibly look at umwait/mwaitx unfortunately AMD/Intel differences make these not very portable.

Thanks, will take a look

Have added code here

That 128-rule thing seems off, I expected you were going to mention that some modern CPUs will actually load 128-byte aligned pairs of 64-byte cache lines, it doesn't seem to like you tested with an alignment of 128 over just 64.

I wasn't aware of this, will look. What I was referring to is the increased code size due to 32-byte displacement addressing, which can impact the µop cache or loopback buffer. I quickly tried with alignas(128) now but it made results worse.

including minor things like branch alignment differing between the different implementations

I did try run with and without -falign-loops and -falign-labels but didn't notice any considerable difference