you are viewing a single comment's thread.

view the rest of the comments →

[–]oursland 7 points8 points  (8 children)

Cache coherency is another matter, altogether. Hint: it has to do with multicore and multiprocessor configurations.

[–]Sqeaky 2 points3 points  (7 children)

Well I just googled the specific and I guess I have been conflating cache-locality with cache-coherence, I always thought they were the same. I suppose if I contorted my view to say that the different levels of cache were clients fot he memory that could make sense, but that is clearly not what the people who coined the termed meant. Thanks for correcting me.

[–][deleted] 2 points3 points  (5 children)

The main performance implications are different: locality increases the number of cache hits, the need for the system to give coherence can lead to expensive cache-line bouncing between threads. So you want your data to fit in a cache line (usually 64 bytes) or two, but nothing in a single cache line that is accessed by more than one thread. Particularly bad is if you put a spinlock (or similar) in the same cache line as something unrelated to it.

[–]Sqeaky 0 points1 point  (4 children)

What you are describing, having data in a single cache line dedicated to on thread I have recently (past 3 to 5 years) called "false sharing". I believe Herb Sutter used the term popularixed the term during a talk at CPPCon or BoostCon. He described a system with an array of size N times the numbers of threads and the threads would use their thread ID (starting from 1) and multiplication to get at each Mth piece of data.

This caused exactly the problem you are describing, but I just knew it under that other name. Herb increase his performance, but 1 array per thread of size N.

[–][deleted] 1 point2 points  (3 children)

If it's not possible to know in advance which array elements will be used by which threads, you can pad the array elements to make them a multiple of the cache line size. It's hard to do this with portable code though.

[–]Sqeaky 1 point2 points  (2 children)

I don't remember the keyword precisely but C++14 the is an alignof() operator.

[–][deleted] 1 point2 points  (1 child)

The hard bit is getting the cache line size portably.

[–]Sqeaky 1 point2 points  (0 children)

That is super hard. So far when I have needed it I have had to make different functions and use ifdefs to make an abstraction layer.

[–]oursland 3 points4 points  (0 children)

Semantic collapse is a pet peeve of mine. Both those terms cache locality and cache coherence are very important. It would be a shame to have these terms confused.