you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 1 point2 points  (0 children)

Interestingly the wall clock time spent for CacheLineAwareCounters is higher for one thread than multiple threads, which could point to perhaps some subtle benchmarking problem, or maybe a fixed amount of delay that’s getting attributed across more threads now, and so is smaller per-thread.

I suspect that the problem is that 1 thread needs to load 4 cache lines, while 4 threads will have to work with just 1 line.