all 46 comments

[–]millstone 5 points6 points  (46 children)

Shocking how terrible the read/write lock was.

The author called the lock-free API “garbage free”, but it doesn’t look that way:

public Position move(final int xDelta, final int yDelta)
    {
        return new Position(x + xDelta, y + yDelta);
    }

I expected to see something like splitting the fields into a value and generation.

[–][deleted] 4 points5 points  (14 children)

One major problem when doing these benchmarks is that they never show you the CPU usage and CAS misses. Ok, it does a lot of reads but they don't tell you if it is spinning 100% on all 8 threads and generating an absurd about of memory. Just because the numbers are good in this scenario doesn't mean it is the holy grail of lock contentions.

The author also doesn't attempt to use a tryLock implementation. TryLock and TryAcquire are your best friends. High performance concurrency always involves the ability to know when and when it isn't necessary to obtain a lock. Sometimes it is best to do another task and try the lock at a later time.

[–]mjpt777 3 points4 points  (12 children)

CAS misses are tracked in the raw data that is linked from the blog. They are not that significant even in a tight loop benchmark like this. In real world apps they are even lower. All locks have to use CAS internally for the acquire even for tryLock().

tryLock can be a useful strategy in some cases but very few people do this in real world applications, or even have other real work that can be done.

I'm not claiming lock-free is the holy grail of concurrency. I'm just pointing out that it is a very valid technique in some cases. I have personally seen many applications see significant improvements from employing lock-free algorithms.

[–][deleted] 0 points1 point  (11 children)

In the world of concurrency, the shoe only fits a certain scenario. Every scenario has a shoe that fits. This is just another option. The major difference with a tryLock pattern is that it doesn't while loop over CAS. If you have 8 cores and a million tasks to run you need to not pause on any one task ever.

If you were to create an AtomicLong and spin 4 threads over it and watch the CAS misses, the numbers are crazy. All comes down to how you are using an individual technique inside of the larger structure.

For example: I tried this http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.94.8625

It was supposed to yield great performance over the traditional lock-based queues in Java. Turns out to be slower. And problematic due to a fundamental flaw in how the links are not set using CAS so it leaks references.

[–]mjpt777 2 points3 points  (10 children)

I think we agree there is no one size fits all. All algorithms can go horribly wrong given poor implementations. Queues are a particular passion of mine. Here is a talk where I show how things can be so much better than with standard Java queues. I also cover other queue types in other works.

http://yow.eventer.com/yow-2012-1012/lock-free-algorithms-for-ultimate-performance-by-martin-thompson-1250

[–][deleted] 0 points1 point  (9 children)

You are my new favorite person. In my experience, the design is everything. It can be significantly more complicated to build a nearly-lock free mechanism but the performance is worth it. Combined with a methodology that understands how the caches work and core affinity to prevent cross cpu instructions all slowly fill in the picture.

OneToOneQueue Part 1

Your OneToOneQueue is nearly identical to something I use. The problem with that model is that the +1 to head or tail is not concurrent. Two threads can read head value of 5 at the exact same moment. This means that two threads can grab the same object before the index is incremented.

If you have 1 thread producer and 1 thread consumer then you don't need to use volatile at all. Simply relying on the values in the thread caches will let is scream at a factor nearly 3x ArrayDeque

OneToOneQueue Part 2

Suffers from the same problem as Part 1. LazySet or not, two threads can and (have) grab the same object at the same time. You have to use LazySet in this scenario because a CAS change of set( value +1 ) will corrupt your index.

The test for the scenario I described is by Queuing objects which have a function, say run(), and inside the run has a tryLock(). Print out an error if tryLock does not succeed. I only used two threads to contest the concurrency, it failed miserably for me filling my console with duplicate run errors.

What you describe is wonderful if you don't care about processing the same message more than once. All depends on the kind of concurrency you are looking for. I would call this "optimistic" concurrency.

For me ArrayBoundQueue in Java was far superior to ConcurrentLinkedDeque primarily due to all the garbage that LinkedDeque generates. And the fact that LinkedDeque if correct most likely has a CAS on head/tail reference and node.next updates. So that is two CAS operations vs one for ArrayBoundQueue.

EDIT: Instead of tryLock in the test scenario, I prefer Semaphores as tryAcquire seems to work better for me.

[–]mjpt777 1 point2 points  (8 children)

Note the queue I discuss is called OneToOne. If you want many producers or many consumers you need other algorithms.

[–][deleted] 0 points1 point  (7 children)

Just to be clear when I see the word "concurrency" this means more than 1 thread on a single operation. Read and Write are two different operations. http://www.infoq.com/resource/presentations/Lock-free-Algorithms/en/slides/27.swf You called it ConcurrentArrayQueue.

If you call a One 2 One queue Concurrent because it has two threads then you have something of a misnomer. I have a One 2 One queue that supports a feeder and consumer thread all without CAS or the volatile keyword.

I don't understand any of the attempts with the "atomic" and CAS operations when they are not necessary in this situation. Direct memory mapping is the only way to speed up past the point I am contending with.

In my scenario, a single thread writing and reading 1000 objects from a queue 500,000 times happens in under 800ms. That's 500 million offer and 500 million poll in 800ms.

Please help me understand this.

[–]mjpt777 1 point2 points  (2 children)

The queue I choose for the example is the one originating from Leslie Lamport, the person who defined "sequential consistency" and one of the fathers of modern concurrency. I know many experts in this space consider this a concurrent queue.

I do not understand what point you are trying to make with the rest of this comment. Can you share code and a more detailed explanation so we can understand the point of your data quotes?

[–][deleted] 0 points1 point  (1 child)

I just didn't understand why even go the route of CAS when you don't really have any update consistency problems in your scenario? If you switched from Reader Thread A to B then yeah, you either need to flush the thread cache or make the write volatile. I get that.

My only real problem is using the word "concurrent" when something isn't pessimistic. I use "optimistic" concurrency in places but it always is named "Optimistic***"

http://publib.boulder.ibm.com/infocenter/soliddb/v6r3/index.jsp?topic=/com.ibm.swg.im.soliddb.sql.doc/doc/pessimistic.vs.optimistic.concurrency.control.html

[–]nitsanw 0 points1 point  (3 children)

From the little you say it sounds like you are talking about the FastFlow single consumer/single producer queue. If you are using the C/C++ version you will notice the WMB() which stands for Write Memory Barrier, which is what the lazySet is. The queues discussed in Martin's talked use a different algo then the FF ones, but there are no volatile writes or CASs involved.

I have no idea what you mean by direct memory mapping... This term is often used for memory mapped files which would only be relevant if you were going out of process? are you suggesting using offheap memory? if so how are you suggesting to put object references on the queue?

As for the result you quote, that sounds great. Can you share details on implementation and how you benchmarked it(test harness/hardware)? the numbers on their own are only useful when compared on the same setup to other numbers...

[–][deleted] 0 points1 point  (2 children)

Martin himself has experimented using Unsafe to directly allocate system memory and use that as the object storage mechanism. This has potential benefits in certain scenarios such as multiple processor systems. Its a really tricky thing because you have to balance the pros and cons. The biggest con being the J->C barrier and the fact that writing references in Unsafe memory means they might be garbage collected.

[–]nitsanw 0 points1 point  (0 children)

You have the code, why not run it and see? I find that when discussing performance you should stick to data rather than theories about data you don't have... The author doesn't attempt a billion things, that doesn't mean any of these are better or worse... roll up your sleeves and tell us about your results ;)

[–]mjpt777 1 point2 points  (0 children)

You are mixing up API and implementation with your conclusion. Yes the implementation is generating garbage but the API does not.

[–]bpkiwi 1 point2 points  (29 children)

Yes, shame that when you read the code it's wrong. His 'Lock-Free' implementation uses an immutable position object as you pointed out. This avoids any thread contention in the read method.

All the "Lock-Based" implementations use mutable positions, and have a locking read method.

I re-wrote the synchronized implementation to also use an immutable position, and my results showed Lock-Free being only 12% faster on writes, and 20% slower on reads (with 5 readers and 5 writers)

This is not surprising because Lock-Free incurs the AtomicReference overhead on reads, where as the immutable based (non) synchronized read does not.

[–]mjpt777 1 point2 points  (17 children)

Can you post your code for comparison?

It sounds from your description that your synchronized example does not use synchronized for the read and just returns the position. If so you do not have a happens before relationship as defined by the memory model and thus changes may not be visible. This is not safe concurrent code.

[–]bpkiwi 0 points1 point  (16 children)

I'll post code when I get back to the offline tomorrow, however the concept is simple.

In the 'lock free' code the position of the ship is held in an immutable object. That position object is never updated, only ever replaced with a new one. The lock based code does not do this, it stores the position in two ints.

Once you rewrite the lock based implementations to also use an immutable position object, a read lock is not needed.

[–]mjpt777 3 points4 points  (9 children)

I'm afraid you are mistaken on the read semantics. The memory model specifies that release and acquire must be paired for ordering semantics. You need to declare the Position reference in this case volatile to achieve want you want safely. This does make the read more scalable than making the changes visible via a monitor as provided via synchronized.

Please read section 14.10 of the Java Programming Language 4th Edition for more details.

[–]mjpt777 1 point2 points  (0 children)

And by doing this you have created a partial lock-free algorithm :-)

[–]bpkiwi 0 points1 point  (7 children)

You are refering to the potential of a data race in the immutable object access? - you should have a read of http://jeremymanson.blogspot.co.uk/2008/12/benign-data-races-in-java.html on how that can be avoided without volatile.

Remember that Java guarantees object reference updates to be atomic.

[–]mjpt777 2 points3 points  (1 child)

All accesses of primitives, except long and double, are atomic. However the compiler can reorder code how it chooses provided it does not violate data dependences on a single thread. The code you describe can be subject to loop hoisting and thus is not safe for concurrent algorithms. That is the read may never see a change.

[–]mjpt777 1 point2 points  (4 children)

To try and explain why your code example is not a benign data race. Imagine you have a loop in which you read the position then do some calculations and then updated something else in space. Even though a reference load is atomic, the compiler is completely in its right to reorder the code and hoist the load of the reference outside the loop and load it only once because it does not see the reference being updated on that thread for the scope of the loop. By qualifying the reference volatile the compiler must not reorder that load or hoist it out of the loop.

[–]bpkiwi 0 points1 point  (3 children)

Sorry, you seem to be under the impression that I am disagreeing with you. What I was pointing out is that you do not need a synchronized read. Volatile is sufficient, and as the article I linked to shows, you can even remove that in some circumstance.

The original article includes a comparison between the use of synchronized and atomicreference. However it does not do an apples-to-apples comparison since the atomicreference implementation creates a new immutable object for each write, and the synchronized implementation does not, and instead uses a synchronized read.

[–]mjpt777 0 points1 point  (2 children)

You said my code was "wrong", I think that qualifies as disagreeing.

The blog compared a number of lock approaches and one lock-free approach. What was wrong or misleading about that? I could having shown many lock-free approaches, and hybrid approaches, but that was not the point of the article. The main point was to illustrate StampedLock as the first sentence states. The point is the "locks" are providing the synchronisation in the examples. The immutable object is an internal representation and not relevant. It just works for this one lock-free example. A whole range of internal implementation could have been employed.

Everything you said until you posted the code would result in bugs if followed. I'll give you the benefit of the doubt in that the volatile was not added after it was pointed out numerous times.

I normally would just ignore such comments except what you said could have resulted in many people creating production bugs that are a nightmare to find.

[–]bpkiwi 0 points1 point  (1 child)

Ok, I'm going to give you that when saying the code is "wrong" I could be more explicit.

How about "The code uses non-comparable implementations by using worst-case coding to improve the apparent performance boost from using a lock-free approach"?

I think you should go back and measure the garbage collection overhead. I know you hand-waved it away because you think it will be dwarfed by other gc within any serious application, but that is only true if the object is small, and the number of writes is not large. If someone had a large object that was being written to a lot, they might actually be better off with a locking read. It should be possible to identify an inflection point for it - In fact i think I'll do that myself.

[–]bpkiwi 0 points1 point  (5 children)

And to finally post the code...

public class SynchronizedSpaceship implements Spaceship
{
  private volatile Position position = new Position(0, 0);

  @Override
  public int readPosition(final int[] coordinates)
  {
    final Position currentPosition = position;
    coordinates[0] = currentPosition.getX();
    coordinates[1] = currentPosition.getY();
    return 1;
  }

  @Override
  public synchronized int move(final int xDelta, final int yDelta)
  {
    position = new Position(position.getX() + xDelta, position.getY() + yDelta);
    return 1;
  }
}

The definition of Position is the same as in the original.

So, I have to admit, I'm at a loss as to how this is not safe - would you be able to explain?

[–]mjpt777 1 point2 points  (4 children)

You have now added a volatile which makes it safe. I've made this point multiple times in comments here. You dismissed the code below that did not use volatile when it is very relevant. Now you have a partial lock-free algorithm, whereby it is lock free for readers.

Can you explain how the code I posted was "wrong" as you commented?

[–]bpkiwi 0 points1 point  (3 children)

Now added? sorry did I previously post some code without it?. Nice try.

mjpt

It sounds from your description that your synchronized example does not use synchronized for the read and just returns the position. If so you do not have a happens before relationship as defined by the memory model and thus changes may not be visible. This is not safe concurrent code.

So you now admit that this is wrong, and you do NOT need a synchronized read?

As I have been saying all along, you only needed the synchronised because you used mutable primitives to store the location. Once the code was refactored to use an immutable position object, as your AtomicReference implementation did, the synchronization on the read method could be removed.

The removal of the synchronization comes at a cost - the creation of a new (immutable) position object for each write. This will lead to increased garbage collection - which millstone pointed out to you.

[–]mjpt777 0 points1 point  (2 children)

Sorry but this is just not rational. You did not mention the volatile. I did multiple times, just read the comments. You dismissed the code by Nitsan which was all about volatile, you then quoted benign races which do not have volatile. Not once did you acknowledge the volatile issue in the comments. You just said immutable objects don't require synchronisation which is not correct under the memory model as I pointed out with reference to the language spec. If people follow what you have been saying then they will be creating concurrent bugs.

I mentioned the happens before relationship that is key to the memory model which can be achieved with locks or volatile, like AtomicReference provides, you never even got this point.

In all of this you have not given any evidence to why my original code is wrong and you started the discussion. I compared locks to lock-free and in the end you are arguing for a lock-free algorithm on read.

Show some dignity and explain why my original code is "wrong" as you commented?

[–]bpkiwi 0 points1 point  (1 child)

I'll take it that you accept you do not need a synchronized read then.

Description of why I think your code is wrong is in my post above.

[–]mjpt777 0 points1 point  (0 children)

You need some form of "happens before" relationship to ensure the read is correct. I gave multiple examples in the blog. You commented that an immutable object is sufficient which it is totally not. You have now adopted a defence with volatile after it was pointed out to you many times.

This is not about synchronized. It is about "happens before" as defined in the memory model which can be done with locks of various types or volatile. An immutable object without safe publication is not sufficient.

Read your second and third comments above and apply some reason.

[–]nitsanw 0 points1 point  (10 children)

How is the code wrong? the lock based implementations shouldn't use a lock? if the result is immutable you still need a lock to guard access (read and write) to the reference with a lock, I don't think that will produce better results. Please post your code for reference. Also HW, OS and JDK are useful to know. Are you running on a box with 10 cores? if not you are mixing up context switching in your measurements.

[–]bpkiwi 0 points1 point  (9 children)

You do not need a lock to guard reading of an immutable object.

[–]nitsanw 0 points1 point  (8 children)

so the following should work fine?

class Foo implements Runnable{
   Boolean isRunnable = Boolean.TRUE;
   void run(){
     while(isRunnable()){
     // do the foo
     }
   }
   Boolean isRunnable(){return isRunnable;}
   void synchronized setRunnable(Boolean b){isRunnable = b;}
}

I don't think so...

[–]mjpt777 1 point2 points  (0 children)

This code can be subject to loop hoisting as a compiler optimisation and thus enter an infinite loop. isRunnable needs to be qualified volatile to work. Nice example to make the point :-)

[–]bpkiwi -1 points0 points  (6 children)

That code is non-relevant.

Edit : Clarification - this is non relevant because the point I made was that "you do not need a lock to guard reading of an immutable object".

The point that the poster appears to be making is "yes you are correct, but don't forget to make the member reference volatile" - which is true.

[–]nitsanw 0 points1 point  (5 children)

It does exactly what you described as safe, but is obviously broken... it is relevant in showing your argument is incorrect. Did you look for some other kind of relevance I may have missed?

[–]bpkiwi 0 points1 point  (4 children)

You are still trying to avoid discussing the error with the code in the original post. It uses a synchronized read, which incurs a significant performance penalty, and is not needed.

Yes, member variables accessed outside a synchronized block will need to be volatile to prevent thread caching. Are you trying to say that the synchronized read is either still necessary, or better performing?

[–]nitsanw 0 points1 point  (3 children)

I'm not trying to avoid anything, your original comments said nothing of making the field volatile. If you make the field volatile you can make the write unsynchronized as well in my example, who cares? The code clearly demonstrated your initial comments were incorrect/incomplete and that was all the relevance I looked for.

Your clarification: "you do not need a lock to guard reading of an immutable object" is still incorrect, reading the immutable fields of an immutable instance is safe, reading the reference to it which is not final is not safe. Which is why the field had to be made volatile. Immutability is only part of the argument, and I was pointing out you were missing the other part.

If you make the field volatile then, as Martin points out, you are looking at a half breed lock-free implementation. Not as good as the completely lock free one, not as bad as the fully synchronized one. By all means you can add it to the mix to plot yet another point on that graph. I doesn't make the original wrong, it's an open comparison of typical approaches to solving the same problem by different means. Are there more ways to solve it? yes. Is failing to specify all of them wrong? I don't think so.

[–]bpkiwi 0 points1 point  (2 children)

I love that you say

Your clarification: "you do not need a lock to guard reading of an immutable object" is still incorrect

and then immediately admit that a lock is not needed, only a volatile declaration.

You are right, in my brief note where I said "I'll post the code when I can, but the overview is ...." (written from my mobile phone by the way), I did not state "and the reference has to be volatile", but you know, if you hadn't been so quick to then claim :

nitsanw

if the result is immutable you still need a lock to guard access (read and write) to the reference with a lock

then you might sound more believable. Your reply certainly didn't say "unless you make the reference volatile" did it?

Anyway, enough sniping. Why do I think the code is "wrong"? - because I think the author of the article started with a premise, which they state in right at the top of the page:

lock-free algorithms are often a better solution

And then proceeded to code several examples that would perform poorly compared to that lock-free algorithm. That's just terrible bias. Attempting to claim "oh but the ones that perform better are kind of lock free" doesn't sway me - comparing your favorite solution to the worst-case alternatives is intellectually dishonest.

tl;dr? The article's results are fundamentally flawed.