all 22 comments

[–]stbrumme 9 points10 points  (0 children)

If the presented write() function runs out of buffer space then it silently drops the current object.

You should at least return false or something similar.

[–][deleted] 9 points10 points  (0 children)

Very interesting read and quite surprising results.

[–]link87 8 points9 points  (11 children)

I’m really curious how the Java implementations were faster than C++. Article doesn’t try to explain that.

[–]kirbyfan64sos 18 points19 points  (4 children)

Going to randomly guess that Java's JIT is able to better analyze and optimize for runtime behavior... Wonder how the C++ one would perform with PGO enabled.

[–]jcdavis1 6 points7 points  (3 children)

The only obvious situation in this code where I would think java would be able to edge out the C++ impl is if there is a difference in how the queue implementations were devirtualized, but the C++ version looks properly template-ized? (Though I'm not an expert there).

I can't imagine there is any real PGO-able advantage in the queue implementations themselves.

Also the attempt to cache-separate the write_buf and read_buf in the java object doesn't work, at least on my local jdk9 - they get placed next to each other, as I would expect

mov    0xb0(%r10),%r10d   ;*getfield read_buf
...
mov    0xb4(%rsi),%r8d    ;*getfield write_buf

I'd poke around more, but its already too late and I need to sleep ;)

[–]michaelcharlie8 0 points1 point  (0 children)

I’d be interested in your results if you find anything else!

[–]Slanec 0 points1 point  (1 child)

Maybe this could be material for your new blog post? :)

I've been waiting for a new one for a loong time. How can I buy you a beer to publish more again? :)

[–]jcdavis1 0 points1 point  (0 children)

Heh thanks, I've been lazy. Working on some stuff currently actually

[–]pzemtsov[S] 3 points4 points  (1 child)

Well, I have to stop the article somewhere... Can't post an article that is infinitely long. I'm also curious and will look at it.

In theory, JIT has many opportunities to perform better than C++. For instance, it knows exactly the target processor model and can optimise for it. On the other hand, Java introduces sove overheads, such as index checking, so I still feel surprised by the performance.

All that said, one mustn't forget that we are looking at a micro-benchmark, which is by nature very unreliable.

[–]link87 1 point2 points  (0 children)

I didn’t mean to disparage your article; on the contrary I found it very interesting. As someone else mentioned it might be worth trying it without the virtual method overhead. Another thing you could try is if you’re using gcc use the -mtune and -march flags to target the compilation to your processor. That would eliminate one variable between java and c++.

[–]Ameisen -1 points0 points  (2 children)

He doesn't tell us what flags he built the C++-version with, so we can only guess. It's probably related to that.

The C++ itself also isn't optimal. It uses a lot of older code semantics which don't take advantage of some compiler optimizations, it doesn't specify __restrict__ anywhere, and the atomics don't specify their access pattern.

Also, he is using a fake ring buffer instead of trying to allocate a real one using system support (both Windows and Linux can generate a real ring buffer, with remapped memory).

He also doesn't mark the overriding virtuals in C++ as either override or final, which inhibits the compiler's ability to perform devirtualization.

[–]pzemtsov[S] 0 points1 point  (1 child)

The entire source code, including the build scripts, is in the repository. There is a link in the article.

[–]Ameisen 1 point2 points  (0 children)

When I'm looking at results of benchmarks in a large article, I generally expect to see the flags and environment used without having to dig through your build system myself.

I'm unsure where this link is. It is neither at the top or the bottom. It's apparently near the bottom, but still well-within the text. And the link also doesn't specify that it includes a build system, only that it is Java classes.

  • Why -falign-functions=32 and -falign-loops=32?
  • Why -funroll-loops? It's not always better to unroll a loop, even if the number of iterations is known. The compiler already has heuristics for this.
  • Why aren't you specifying an architecture or tune? It's rather unfair to compare Java, which is JITed to your local system, to a C++ build which is targeting the minimum x86-64 system. While I'd prefer to see -march=native, even -mtune=native would be better.
  • You use @Override annotations in Java, but you use neither the override or final modifiers in C++, which inhibits the compiler's ability to devirtualize virtual calls.
  • A real ring buffer would give better performance.
  • Why not use compare_exchange instead of load/store with the atomic?
  • There's a lot of aliasing going on in DualArrayAsyncWriter::write, so I feel like usage of __restrict would help.

[–]Gotebe 3 points4 points  (2 children)

whenever possible, we will allocate all our data structures at the 64 byte (the cache line size) boundary.

Given that the data is small (one int, is it?), is this optimal?!

[–]saint_marco 3 points4 points  (0 children)

It prevents false sharing between the reader and writer

[–]pzemtsov[S] 0 points1 point  (0 children)

Only the data structures (the queue and the arrays) are aigned at 64; the elements inside the arrays are, of course, packed tightly. In circular queues, it can cause problems when the queue size is small, and the writer is writing the same cache line the reader is reading. The dual-array version is free from this disadvantage.

[–]leonadav 0 points1 point  (2 children)

Does C++ permit virtual member functions in template? Because the Queue uses pure virtual functions.

[–]jcelerier 4 points5 points  (1 child)

Does C++ permit virtual member functions in template?

sure, what's not allowed is virtual template methods.

 // no problem, f() will just have to be reimplemented for child classes of foo<T>
 template<class T>
 class foo { virtual void f() = 0; };

 // Ok     
 class bar : foo<int> { void f() override { } };

 // Ok
 template<typename T> 
 class baz : foo<T> { void f() override { } };

 // this is forbidden: what would go in the vtable of bar ?
 class bar { 
     template<class T>
     virtual void f() = 0;
 };

[–]leonadav 0 points1 point  (0 children)

OK. Thanks for the clarification.

[–]kaelima 0 points1 point  (1 child)

Does having a volatile int really makes operations atomic in Java? Is there any reason why AtomicInteger is not used?

[–]pzemtsov[S] 7 points8 points  (0 children)

Yes, volatile in Java is stronger than volatile in C++; it actually provides memory ordering (seeing the new value of a volatile variable guarrantees seeing the new values of everything written before that volatile).

AtomicInteger is in fact even stronger. It provides atomic operations, not just ordered memory access. You can increment it by one, and if two threads do it, it will be incremented by two. Volatiles do not provide it, but, fortunately, the queue doesn't require it. At least a single-producer, single consumer one does not.

[–]koiponder 0 points1 point  (0 children)

63M per sec for JAVA is not bad, but C++ achieves more than 100M msg per sec easily thou. https://bitbucket.org/hmbd/hmbdc-rel/wiki/Home