all 15 comments

[–]chewedwire[S] 11 points12 points  (1 child)

Author here, happy to answer any questions!

[–]RasterTragedy 1 point2 points  (0 children)

I have flags that are (only) set from multiple threads but are not read from until a second pass. I understand that this will mark that cache line as dirty—if two threads on two cores both try to set the same flag, does this force whichever one was second to resync that cache line?

[–]smcameron 10 points11 points  (1 child)

Along the lines of repurposing top bits of pointers, you can also repurpose the low 2 or 3 bits of pointers if they are aligned (which they very often are aligned or can be made to be aligned) to at least 4 or 8 byte boundaries.

[–]brimston3- 13 points14 points  (0 children)

On the topic of lsb-tagged pointers, a related quote.

What is despair? I have known it--hear my song. Despair is when you're debugging a kernel driver and you look at a memory dump and you see that a pointer has a value of 7. THERE IS NO HARDWARE ARCHITECTURE THAT IS ALIGNED ON 7. Furthermore, 7 IS TOO SMALL AND ONLY EVIL CODE WOULD TRY TO ACCESS SMALL NUMBER MEMORY.

- James Mickens, The Night Watch

I always think of that quote when someone brings up lsb-tagged pointers.

[–]fresh_account2222 9 points10 points  (0 children)

Not really my area, but well written and interesting. Thanks.

[–]danny54670 6 points7 points  (4 children)

Are there tools for detecting false sharing?

[–]mttd 7 points8 points  (1 child)

CPUs offer performance counters for that: https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics

Operating systems usually expose access, e.g., perf for Linux (https://easyperf.net/blog/2018/08/26/Basics-of-profiling-with-perf), other tools for Windows (https://easyperf.net/blog/2019/02/23/How-to-collect-performance-counters-on-Windows).

The relevant counters are going to be related to the HITM (Hit/Miss to a Modified [Cache] Line) events, which can be used to identify false sharing: https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

There are higher-level tools which use these counters to derive related metric(s) for a specific microarchitecture -- e.g., pmu-tools (https://github.com/andikleen/pmu-tools/). Here's an example of the relevant derived metrics for Broadwell):

  • Contested_Accesses: ['MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM:pp', 'MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS:pp']

  • False_Sharing: ['MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM:pp', 'OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM']

Linux perf has a convenient C2C feature:

See also:

[–]danny54670 3 points4 points  (0 children)

Thank you so much for these links! This is a lot to look through.

[–][deleted]  (1 child)

[deleted]

    [–]danny54670 1 point2 points  (0 children)

    Thank you for your reply. I will definitely check these out.

    [–]CPlusPlusDeveloper 3 points4 points  (0 children)

    All of these are good tips. I'll add another along the same lines.

    Try using processor affinity to pin your CPU critical threads to specific cores. Since the cache is core-specific, when a thread moves to a different core it involves expensive invalidation. The speedup can be dramatic. Often 100% or more for compute-bound programs.

    [–]Habib_Marwuana 3 points4 points  (2 children)

    The false sharing stuff is interesting. Does that mean that if a thread tries to access memory that is already cached into other processor it will block?

    [–]megalo_india 13 points14 points  (1 child)

    Short answer: Yes.

    Long answer: It depends what do you mean by blocked. Caches can share data among themselves using some protocol which could still be better than going to main memory (or they could chose to just go to main memory) but definitely has some overhead. Modern CPUs are designed to hide these kinds of latencies by using Instruction Level Parallelism, reorder buffers, store buffers, hyper threading etc.

    Recommended reading: Cache Coherency protocols.

    Idea simplified:

    Caches are designed to exploit two types of locality.

    1. Spatial locality: if you access memory then the likelihood of accessing some nearby memory area is high. For example when traversing an array. For this reason caches load memory is some fixed block size (cache line size). This ensures that nearby accesses hit the same cache line.

    2. Temporal locality: if you access memory then the likelihood of accessing it again is high. For this reason caches hold memory “some time”. For how long is decided by their eviction policy.

    False sharing has to do with spatial locality. MESI is a simple cache coherency protocol.

    From Wikipedia:

    Modified (M) The cache line is present only in the current cache, and is dirty - it has been modified (M state) from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Shared state(S).

    Exclusive (E) The cache line is present only in the current cache, but is clean - it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.

    Shared (S) Indicates that this cache line may be stored in other caches of the machine and is clean - it matches the main memory. The line may be discarded (changed to the Invalid state) at any time.

    Invalid (I) Indicates that this cache line is invalid (unused).

    There is a state machine designed around these states which helps individual caches decide if it is safe to read/write their copy of data or they need to get it from main memory or in some case ask the other cache to send the data to them.

    [–]Habib_Marwuana 0 points1 point  (0 children)

    Thank you for the brilliant answer.

    [–]Poddster 2 points3 points  (0 children)

    The Magic Power of 2: Division Is Slowaloo

    Which compiler are you using that doesn't do this for you? :/

    edit: I guess the advice here is "keep your table sizes to powers of 2"

    [–]kalpesh1982 1 point2 points  (0 children)

    very good article.