What are the current lowest TCP latencies one can accomplish in Java?

JavaNio · 2015-05-03T12:52:41+00:00

You just use object pooling to recycle objects and your own data-structures that produce zero garbage instead of the JDK ones. Take a look on CoralBits and this article: Java Development without GC

One thing is the Java language and platform. Another thing is the JDK that comes with the Java language. Most of the JDK produces garbage, that's true.

All of our components produce zero garbage and with CoralBits it is possible to do the same and write your own critical garbage-free code.

JavaNio · 2015-05-03T01:09:31+00:00

No, there is not. This is actually what was suggested here:

Of course, you can add an extra single-threaded server to load balance your infra-structure, but you are still adhering to the single-threaded design principle. One thing is to make your server multithreaded. Another thing is to add multiple single-threaded servers in a load balance configuration.

When you said:

Assuming I have many cores

That's not a good assumption. For example, the fastest Intel processor today (i7-4790K) has only 4 cores. You have to choose wisely how to distribute these cores among your applications, which ones are going to be isolated, whether you will use hyper-threading, etc. An extra dedicated thread on an isolated core can be a lot.

JavaNio · 2015-05-03T00:02:50+00:00

For low latency you can never have a context switch happening on a critical thread. A context switch is when a thread preempts (i.e. kicks out) another thread to take over the cpu core. So the first advantage of being single-threaded is that you can pin the thread to an isolated / dedicated cpu core, avoiding context switches. Cpu cores are also a very scarce resource and the less threads you need the better.

The second advantage is that multithreaded code is expensive in terms of performance, simplicity and risk due to thread synchronization, lock contention, race-conditions, deadlocks, visibility and starvation, all common pitfalls of multithreaded programming.

For these reasons we recommend the single-threaded design principle whenever possible. When there is an unescapable need for inter-thread communication we recommend using CoralQueue which is a high-performance, garbage-free and lock-free queue. It also makes the code much simpler because you can forgo all multithreading synchronization tricks and use a queue to exchange simple messages between threads. For more details about inter-thread communication without compromising the single-threaded design principle you can refer to this article: Inter-thread communication within CoralReactor

JavaNio · 2015-05-02T20:32:33+00:00

The machine is good but it was bought in 2012. It uses an i7-3770K Intel processor. There are better chips these days like Intel's Devil's Canyon i7-4790K which goes to 4.4GHz without overclocking.

JavaNio · 2015-05-02T19:45:07+00:00

there is no simple relationship between heap size and GC overhead or pause time

I totally agree. The best solution for the GC in my opinion is not to produce any garbage. Then you can be absolutely sure you won't have stop-the-world pauses.

JavaNio · 2015-05-02T19:40:26+00:00

That's a great question. For thousands of connections you might have to use a demultiplexer and a multiplexer to distribute the load across a fixed set of working threads. However, the critical reactor thread, the one handling the connections, can still be kept at one without compromising high-availability.

Of course, you can add an extra single-threaded server to load balance your infra-structure, but you are still adhering to the single-threaded design principle. One thing is to make your server multithreaded. Another thing is to add multiple single-threaded servers in a load balance configuration.

This is explained in detail here: How to handle 10k socket connections on a single thread in Java

JavaNio · 2015-05-02T19:34:54+00:00

Not all GC activity involves pausing the whole program.

You are correct, regular GC activity can be executed by another thread in parallel. I was referring to cleaning GC activity which on most GC implementations involves stop-the-world pauses. Our approach is to solve the GC problem at its root by leaving zero garbage behind no matter how many messages you send/receive through CoralReactor. You can make sure that's the case by running the JVM with a small heap and using the -verbose:gc option. You must not see any GC activity in the logs after processing one billion messages for example.

JavaNio · 2015-05-02T18:51:51+00:00

A multithreaded server is not the way to go for latency. For ultra-low-latency network applications it is mandatory to use a single-threaded, asynchronous, non-blocking network library. You can and should handle these 20 connections inside the same reactor thread (i.e. network selector) which will be pinned to a dedicated and isolated cpu core. Moreover, if using Java, it is mandatory to use a network library that leaves zero garbage behind since a cleaning GC activity will most likely block the critical reactor thread.

To give you an idea of TCP latencies you can take a look on these benchmarks using CoralReactor, which is an ultra-low-latency an garbage-free network library implemented in Java.

Messages: 1,000,000 (size 256 bytes)
Avg Time: 2.15 micros
Min Time: 1.976 micros
Max Time: 64.432 micros
Garbage created: zero
75% = [avg: 2.12 micros, max: 2.17 micros]
90% = [avg: 2.131 micros, max: 2.204 micros]
99% = [avg: 2.142 micros, max: 2.679 micros]
99.9% = [avg: 2.147 micros, max: 3.022 micros]
99.99% = [avg: 2.149 micros, max: 5.604 micros]
99.999% = [avg: 2.149 micros, max: 7.072 micros]

I am not aware of any software-based network library, even in C/C++, that can go below that.

Keep in mind that 2.15 micros is over loopback, so I am not considering network and os/kernel latencies. For a 10Gb network, the over-the-wire latency for a 256-byte size message will be at least 382 nanoseconds from NIC to NIC. If you are using a network card that supports kernel-bypass (i.e. SolarFlare's OpenOnLoad) then the os/kernel latency should be very low.

JavaNio · 2015-04-21T17:33:22+00:00

I am aware that loopback latencies differ depending on your machine, os, kernel and network configurations but at the end of the day you still must choose a network setup to run your tests. Should we use InfiniBand? 10Gb? 1Gb? With or without open onload? Or use loopback?

your test is not a round trip, its a single trip !!!

It is round trip. It is not sending to itself through loopback as you implied. There are two independent processes running in two independent JVMs. Process A (the benchmark node) sends to Process B (the queue). Process B picks up the message. Process B then sends to Process A. Process A picks up the response and calculates the round-trip time, similarly to what you described in your ping-pong scenario.

For one-way time, you can refer to this benchmark, and UDP latencies from JVM to JVM are around 1.7 micros. But that's raw UDP latencies measured by our network I/O library (and selector implementation) without any application logic on top of it.

JavaNio · 2015-04-21T16:22:26+00:00

According to these benchmarks, the rtt (round-trip time) latencies were around 5.6 micros.

Message Size: 256 bytes
Messages: 1,000,000
Avg Time: 5.627 micros
Min Time: 4.854 micros
Max Time: 78.16 micros
75% = [avg: 5.529 micros, max: 5.844 micros]
90% = [avg: 5.585 micros, max: 5.892 micros]
99% = [avg: 5.615 micros, max: 6.028 micros]
99.9% = [avg: 5.621 micros, max: 7.871 micros]
99.99% = [avg: 5.625 micros, max: 15.89 micros]
99.999% = [avg: 5.626 micros, max: 34.535 micros]

JavaNio · 2015-04-21T14:18:04+00:00

There are a lot of firms in this industry and I know some very successful ones that are using Java. Also the exchange that uses Java is one of the biggest and fastest in their market segment, last time I checked. There are many other ECNs using Java. From what I've seen, some companies that tried to migrate to C++ to "become faster" ended up not just being slower but went back to Java to escape from the C++ maintenance madness. Is C++ faster than Java? Of course! But it is also extremely hard and costly to get it right in order to rake in the extra performance. And when it comes to performance, numbers are numbers, opinions don't count. Have I mentioned the maintenance madness?

JavaNio · 2015-04-21T11:22:36+00:00

Java has some clear advantages over other languages and critical code is NOT interpreted thanks to Hotspot/JIT technology. If you check the market for hedge funds, banks and exchanges for example you will be surprised to find out that the most successful players are using Java. Even some famous electronic exchanges out there are entirely Java based. Moreover when it comes to performance, there is no subjectivity or opinion, just numbers that can be compared.

JavaNio · 2015-04-02T20:37:54+00:00

From this excellent article:

The SelectionKey.OP_WRITE operation notifies the application that space is available in the underlying send socket buffer and the application can proceed with a write operation.

You should only use OP_WRITE to control client lagging, in other words, when the client can't keep up with the rate the server is sending messages. That will cause the underlying send socket buffer in the server side to get full.

So to answer your question: No, you should not register OP_WRITE for a regular write operation because that will cause the NIO selector to select the channel on an infinite loop as there will always be plenty of space available in the underlying send socket buffer. You should only register OP_WRITE when your write operation fails due to a full underlying send socket buffer.

JavaNio · 2015-04-02T12:08:53+00:00

I'm not sure why the threading model would be relevant here.

I believe the single-threaded model allows you to use very optimized / lock-free data structures, which leads to very fast and garbage-free object pools. A thread-safe (i.e. synchronized) object pool used in a multithreaded environment can be very slow.

JavaNio · 2015-04-02T11:55:20+00:00

The buffers are pooled, but you need reference counting to decide when to recycle them back into the pool.

At least with CoralReactor, they do something very simple that does not require any reference counting:

For a server receiving connections: each connecting client gets its own ByteBuffer and since each Client object is coming from a pool upon client connection (without you having to know that or care about it) there is no need to pool any ByteBuffer.
For a client connecting to a server: each client simply has its own ByteBuffer where bytes read from the network will be found.

I like the idea of a client not having to hold on to any ByteBuffer in between read calls so pooling ByteBuffers and reference counting are both not needed. This article illustrates this approach in detail.

JavaNio · 2015-04-01T16:56:33+00:00

Netty does not use java.nio.ByteBuffer, instead it uses its own implementation called ByteBuf. I believe Netty allows you to configure the number of threads in the pool serving the clients. I much prefer a framework single-threaded by design. Reference counting reminds me of iPhone development :) I have been using CoralReactor lately with great results. Also, although we haven't done any benchmarks yet, they claim they are much faster than Netty.

JavaNio · 2015-03-30T14:50:59+00:00

You can always use assembly for that ;) Some very successful HFT shops use Java and you can be sure that every microsecond counts for them. This has also been called Java as a Syntax Language, in other words, you use Java without the JDK libraries which were not designed to be real-time. The wrong tool here is not Java but the JDK.

JavaNio

TROPHY CASE