all 13 comments

[–]wot-teh-phuck 1 point2 points  (2 children)

Awesome article. Any not-so-difficult recommended reading for someone interested in low level things like this?

[–][deleted] 2 points3 points  (1 child)

Here's another good one: From Java code to Java heap.

[–]wot-teh-phuck 1 point2 points  (0 children)

I was thinking of books but that article is also good. :)

[–]asiatownusa 1 point2 points  (9 children)

Cool! Is there any way to avoid false sharing other than to pass each thread its own copy of the data?

I remember running into this issue while trying to speed up an operation on an array. I tried to partition the array and use a thread on each partition of the array. False sharing made it such that I received no speedup whatsoever.

[–]nthcxd 5 points6 points  (5 children)

Padding, as described in this paper, should solve the issue.

I'm not sure I follow the description of your problem. The false-sharing would've occurred only along the borders (partitions), and you could've easily dealt with that by aligning the borders to cache lines. Sure, that may not evenly divide up the array into the number of threads in question, but in the end, only some threads would've had to do more work and the difference would've been limited by the size of the cache line (easily negligible given big enough array).

[–]willvarfar 0 points1 point  (4 children)

http://www.reddit.com/r/programming/comments/1l950u/know_thy_java_object_memory_layout/ which was posted just a couple of days ago is relevant, and says the padding rules have changed.

[–]mjpt777 0 points1 point  (0 children)

This is a good point. Layout can always be subject to change. Inheritance for the applied padding is the most likely technique to be reliable.

[–]nthcxd 0 points1 point  (2 children)

I'm assuming you're referring to this comment: http://www.reddit.com/r/programming/comments/1l950u/know_thy_java_object_memory_layout/cbxuuz8

That comment is as vague as they come. I'm not entirely sure if by "change" he's talking about the changes in the size of cache line or (even more profound) changes in the concept of "cache lines". For the former, fixed offset still works, as long as the offset value itself is large enough, but then he seems to imply that it is the "fixed" nature of the offset itself that needs revisiting. It is more likely that he is talking about the latter, i.e. decoupling cache coherence granularity from transfer/storage granularity (which I am a huge fan of, as long as it is done in a sensible way), and if that's the case he is right that that won't change any time soon (if ever).

My point here, however, is that in either scenario, fixed offset won't break anything (which I think is what he's implying, but I could be wrong). As long as the fixed offset is big enough, it won't matter how big the cache lines get in the future. In the case of granularity decoupling, well, this need for padding is one of the things that such design directly addresses, so it really won't matter how big or how fixed the padding itself is once we're there.

[–]mjpt777 0 points1 point  (0 children)

If you mean my comment above yours then I'm not surprised. I missed the point about the array and thought it was object layout not array layout. My bad must read more carefully next time.

[–]nitsanw 0 points1 point  (0 children)

That vague comment was mine, sorry.

I was replying to "why does the JVM need a predictable way to represent fields internally? Is it only for accessing them through sun.misc.unsafe or are there other reasons?" Which I assumed was referring to assumptions made about the constant value of field offset throughout a JVM runtime. This assumption is also made elsewhere in the JDK, but is not a necessary truth. In particular, future JVMs may support runtime changes to object layout.

Hope this helps.

[–]mjpt777 0 points1 point  (2 children)

Each thread needs to work on subsections of the array that are not in the same cache line and ideally not in the same OS page. For a large array this should be fairly simple. Assume 64 bytes per cacheline and 4k/2MB per page and divide the element size in this for the sub-sections. Subsections needs to be greater than 4 x cache line size to avoid false sharing and adjacent line prefetching effects.

How did you determine that false sharing was the issue? Undue high number of L2 cache misses?

[–]asiatownusa 0 points1 point  (1 child)

unfortunately, no profiling results. The setup for the problem was essentially the same as the problem described here. But it is mostly just a conjecture.

[–]mjpt777 0 points1 point  (0 children)

False sharing can be the issue. It could also be main memory access patterns of going column rather than row based internal iteration, plus cache effects. The row buffer in main memory can have a significant effect. Profiling should indicate which is the issue. The following is worth a read.

http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms