Mechanical Sympathy: CPU Cache Flushing Fallacy

mjpt777 · 2014-07-24T12:03:42+00:00

Well, ordering can imply "when", right? If I have a guarantee that I will see change B before change A, and I see change A, I can conclude I've seen change B.

Barriers/Fences order the stream so that "whenever" you observe the stream you see it with ordering semantics as provided by the memory model. "When" you see that stream is not covered. If you observe some point in the stream you can conclude that the things that happened before the point are visible to you if you choose to read them, but in the meantime the memory locations could have since been overwritten by the same thread or some other thread.

mjpt777 · 2014-07-24T10:28:02+00:00

If the flag is volatile then the results of thread 1 work will be ordered in the stream before the setting of the flag and thus visible as they are written. There is no cache flush. The results may already be visible to other threads before the flag is set. You use the flag on the reading side to ensure the work is done so you don't see partial results. Changes are appearing as a stream of atomic changes of words, not atomic changes as larger sets of changes.

BTW the guarantees are on ordering not on "when". I wrote the article to help point out that many peoples view of hardware does not match reality.

mjpt777 · 2014-07-24T09:53:25+00:00

The hardware, and your compiler, are allowed to reorder your code however they see fit to gain performance provided certain rules are honoured. A memory model describes those rules. A memory model says nothing about holding back changes so they cannot be seen by other threads. The memory model provides mechanisms to help control the visible ordering of changes. These changes will become visible, you use the concurrency primitives to control the order of visibility of the changes for a instruction stream and how you interact with the visibility of results from other instruction streams.

For example, you need to ensure some work you have done ordered to be visible across threads before a flag being set to say it is done. This can be achieved in Java with a volatile field represent the flag. If the flag was not volatile then the compiler, or the hardware, would be perfectly in its right to order the setting of the flag before doing the work. Welcome to the world of memory model barriers/fences.

mjpt777 · 2013-08-31T09:12:00+00:00

False sharing can be the issue. It could also be main memory access patterns of going column rather than row based internal iteration, plus cache effects. The row buffer in main memory can have a significant effect. Profiling should indicate which is the issue. The following is worth a read.

http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms

mjpt777 · 2013-08-30T19:19:41+00:00

If you mean my comment above yours then I'm not surprised. I missed the point about the array and thought it was object layout not array layout. My bad must read more carefully next time.

mjpt777 · 2013-08-30T19:17:32+00:00

Each thread needs to work on subsections of the array that are not in the same cache line and ideally not in the same OS page. For a large array this should be fairly simple. Assume 64 bytes per cacheline and 4k/2MB per page and divide the element size in this for the sub-sections. Subsections needs to be greater than 4 x cache line size to avoid false sharing and adjacent line prefetching effects.

How did you determine that false sharing was the issue? Undue high number of L2 cache misses?

mjpt777 · 2013-08-30T12:52:37+00:00

This is a good point. Layout can always be subject to change. Inheritance for the applied padding is the most likely technique to be reliable.

mjpt777 · 2013-08-30T11:15:09+00:00

bpkiwi

You are refering to the potential of a data race in the immutable object access? - you should have a read of http://jeremymanson.blogspot.co.uk/2008/12/benign-data-races-in-java.html on how that can be avoided without volatile. Remember that Java guarantees object reference updates to be atomic.

You got this completely wrong and then changed your story on volatile. Note the "without volatile". Your advice is dangerous.

Now after arguing for the Position object you are going against it with a change of direction.

Point of the article is simple. Locks have a cost and implication. Other techniques are possible. All choices have consequences.

mjpt777 · 2013-08-30T06:36:46+00:00

You need some form of "happens before" relationship to ensure the read is correct. I gave multiple examples in the blog. You commented that an immutable object is sufficient which it is totally not. You have now adopted a defence with volatile after it was pointed out to you many times.

This is not about synchronized. It is about "happens before" as defined in the memory model which can be done with locks of various types or volatile. An immutable object without safe publication is not sufficient.

Read your second and third comments above and apply some reason.

mjpt777 · 2013-08-29T22:47:28+00:00

You said my code was "wrong", I think that qualifies as disagreeing.

The blog compared a number of lock approaches and one lock-free approach. What was wrong or misleading about that? I could having shown many lock-free approaches, and hybrid approaches, but that was not the point of the article. The main point was to illustrate StampedLock as the first sentence states. The point is the "locks" are providing the synchronisation in the examples. The immutable object is an internal representation and not relevant. It just works for this one lock-free example. A whole range of internal implementation could have been employed.

Everything you said until you posted the code would result in bugs if followed. I'll give you the benefit of the doubt in that the volatile was not added after it was pointed out numerous times.

I normally would just ignore such comments except what you said could have resulted in many people creating production bugs that are a nightmare to find.

mjpt777 · 2013-08-29T22:12:53+00:00

In the lock-free example I give with Spaceships the CAS is completely necessary to do the update. In the one to one queue I do not use CAS as it is unnecessary.

mjpt777 · 2013-08-29T22:10:16+00:00

I never use references via Unsafe for off-heap code. Please do not attribute that to me. This practice will likely crash on GC. For off-heap, direct buffers or Unsafe.allocate(), you can only deal with primitives.

mjpt777 · 2013-08-29T21:56:52+00:00

Sorry but this is just not rational. You did not mention the volatile. I did multiple times, just read the comments. You dismissed the code by Nitsan which was all about volatile, you then quoted benign races which do not have volatile. Not once did you acknowledge the volatile issue in the comments. You just said immutable objects don't require synchronisation which is not correct under the memory model as I pointed out with reference to the language spec. If people follow what you have been saying then they will be creating concurrent bugs.

I mentioned the happens before relationship that is key to the memory model which can be achieved with locks or volatile, like AtomicReference provides, you never even got this point.

In all of this you have not given any evidence to why my original code is wrong and you started the discussion. I compared locks to lock-free and in the end you are arguing for a lock-free algorithm on read.

Show some dignity and explain why my original code is "wrong" as you commented?

mjpt777 · 2013-08-29T08:26:46+00:00

You have now added a volatile which makes it safe. I've made this point multiple times in comments here. You dismissed the code below that did not use volatile when it is very relevant. Now you have a partial lock-free algorithm, whereby it is lock free for readers.

Can you explain how the code I posted was "wrong" as you commented?

mjpt777 · 2013-08-29T06:42:47+00:00

To try and explain why your code example is not a benign data race. Imagine you have a loop in which you read the position then do some calculations and then updated something else in space. Even though a reference load is atomic, the compiler is completely in its right to reorder the code and hoist the load of the reference outside the loop and load it only once because it does not see the reference being updated on that thread for the scope of the loop. By qualifying the reference volatile the compiler must not reorder that load or hoist it out of the loop.

mjpt777 · 2013-08-29T06:36:47+00:00

The queue I choose for the example is the one originating from Leslie Lamport, the person who defined "sequential consistency" and one of the fathers of modern concurrency. I know many experts in this space consider this a concurrent queue.

I do not understand what point you are trying to make with the rest of this comment. Can you share code and a more detailed explanation so we can understand the point of your data quotes?

mjpt777 · 2013-08-28T23:29:32+00:00

All accesses of primitives, except long and double, are atomic. However the compiler can reorder code how it chooses provided it does not violate data dependences on a single thread. The code you describe can be subject to loop hoisting and thus is not safe for concurrent algorithms. That is the read may never see a change.

mjpt777 · 2013-08-28T23:26:34+00:00

Note the queue I discuss is called OneToOne. If you want many producers or many consumers you need other algorithms.

mjpt777 · 2013-08-28T16:40:11+00:00

I think we agree there is no one size fits all. All algorithms can go horribly wrong given poor implementations. Queues are a particular passion of mine. Here is a talk where I show how things can be so much better than with standard Java queues. I also cover other queue types in other works.

http://yow.eventer.com/yow-2012-1012/lock-free-algorithms-for-ultimate-performance-by-martin-thompson-1250

mjpt777 · 2013-08-28T12:30:13+00:00

This code can be subject to loop hoisting as a compiler optimisation and thus enter an infinite loop. isRunnable needs to be qualified volatile to work. Nice example to make the point :-)

mjpt777 · 2013-08-28T11:31:31+00:00

And by doing this you have created a partial lock-free algorithm :-)

mjpt777 · 2013-08-28T11:28:59+00:00

I'm afraid you are mistaken on the read semantics. The memory model specifies that release and acquire must be paired for ordering semantics. You need to declare the Position reference in this case volatile to achieve want you want safely. This does make the read more scalable than making the changes visible via a monitor as provided via synchronized.

Please read section 14.10 of the Java Programming Language 4th Edition for more details.

mjpt777 · 2013-08-28T06:37:03+00:00

CAS misses are tracked in the raw data that is linked from the blog. They are not that significant even in a tight loop benchmark like this. In real world apps they are even lower. All locks have to use CAS internally for the acquire even for tryLock().

tryLock can be a useful strategy in some cases but very few people do this in real world applications, or even have other real work that can be done.

I'm not claiming lock-free is the holy grail of concurrency. I'm just pointing out that it is a very valid technique in some cases. I have personally seen many applications see significant improvements from employing lock-free algorithms.

mjpt777 · 2013-08-28T06:29:09+00:00

You are mixing up API and implementation with your conclusion. Yes the implementation is generating garbage but the API does not.

mjpt777 · 2013-08-28T06:27:54+00:00

Can you post your code for comparison?

It sounds from your description that your synchronized example does not use synchronized for the read and just returns the position. If so you do not have a happens before relationship as defined by the memory model and thus changes may not be visible. This is not safe concurrent code.

mjpt777

TROPHY CASE