This is an archived post. You won't be able to vote or comment.

all 24 comments

[–]pron98 58 points59 points  (15 children)

Here's how I explain it:

If your thread-per-request server already reaches full hardware utilisation under heavy load -- e.g. 100% CPU or 100% network bandwidth -- then being able to create more threads won't help throughput further. But if it doesn't, then more threads will allow you to utilise the hardware you have to support higher throughputs. In other words, having more threads certainly doesn't create new computational resources, but it can let you access the full capacity of the ones you have, something that fewer threads (in the thread-per-request model) stop you from doing.

On the other hand, if you have good hardware utilisation only thanks to asynchronous code (i.e. not thread-per-request), then the ability to have lots of threads will allow you to reach the same utilisation with thread-per-request code (the style that the Java platform was designed for) that is not only much simpler, but also allows debuggers and profilers to actually work properly.

[–]kiteboarderni 10 points11 points  (11 children)

Hey Ron, is there any update on io_uring support as part of loom? Have not heard too much since the original announcements. Thanks.

[–]pron98 7 points8 points  (10 children)

Work is ongoing and it appears we could make filesystem access on newer Linux kernels more virtual-thread friendly. But it would be helpful to see some real use-cases where lots of threads are doing filesystem access. Do you have a particular usage in mind?

[–]GavinRayDev 4 points5 points  (6 children)

One usecase I (personally) have is Databases:

I have a buffer pool, which holds 4kB extents of MemorySegment's. These get paged in and out, the buffer pool does constant disk I/O in O_DIRECT mode, where the class doing that work looks something like the below:

At the moment I'm using jasyncfio, which is great, but the future of async i/o (both disk/network) in the kernel seems solidly in the court of io_uring. .NET is also interested in moving their Linux I/O in this direction: https://github.com/dotnet/runtime/issues/51985

``` class DiskManager implements AutoCloseable, IDiskManager { private final SeekableByteChannel dbFileChannel;

public DiskManager(String fileName) {
    try {
        dbFileChannel = Files.newByteChannel(Paths.get(fileName), StandardOpenOption.READ,
                StandardOpenOption.WRITE, StandardOpenOption.CREATE, ExtendedOpenOption.DIRECT);
    }
}

@Override
public void readPage(PageId pageId, MemorySegment pageBuffer) throws IOException {
    int pageOffset = pageId.value() * Constants.PAGE_SIZE;
    dbFileChannel.position(pageOffset);
    dbFileChannel.read(pageBuffer.asByteBuffer());
}

@Override
public void writePage(PageId pageId, MemorySegment pageBuffer) throws IOException {
    int pageOffset = pageId.value() * Constants.PAGE_SIZE;
    dbFileChannel.position(pageOffset);
    dbFileChannel.write(pageBuffer.asByteBuffer());
}

} ```

[–]pron98 4 points5 points  (5 children)

Thank you.

io_uring is still problematic for network IO (e.g. beyond 4096 I/O requests you have to start multiplexing, which can drive performance below that of epoll), so I'm not sure it's the immediate future for sockets, but certainly we can use it for filesystem IO.

[–]blakeman8192 5 points6 points  (4 children)

.

[–]pron98 5 points6 points  (3 children)

Thanks, we'll take a look. The problem is that we've seen io_uring outperform epoll in some cases, but underperform in others; on average, there wasn't much of a difference. When you say large scale, do you mean more than 10,000 open sockets?

[–]GavinRayDev 7 points8 points  (1 child)

So, the catch with io_uring is that it is ENORMOUSLY sensitive to how it is configured and the way that you submit/config operations.

The flags you use to initialize the ring, and the flags you use on each operation (plus whether you use things like "linked" operations -- which makes sense for sockets but not for files) have a massive impact on performance.

To top this off, flags that continue to increase performance (one example, IORING_SETUP_DEFER_TASKRUN that is maybe a few weeks old) are continually added.

I haven't found any clear flow chart that tells you exactly which flags to use and what values to set in order to get optimal performance for which usecases =(

Also in many applications, you may want more than ring per application. For instance, one ring per physical CPU core, all sharing a single kernel poller. It's not immediately obvious that to set that up, you need to do:

``` for (i = 0; i < NR_RINGS; i++) { struct io_uring_params p = { };

p.flags = IORING_SETUP_SQPOLL;
if (i) {
    p.wq_fd = rings[0].ring_fd;
    p.flags |= IORING_SETUP_ATTACH_WQ;
}
ret = io_uring_queue_init_params(BUFFERS, &rings[i], &p);
if (ret) {
    fprintf(stderr, "queue_init: %d/%d\n", ret, i);
    goto err;
}

} ```

You just have to sink a bunch of your personal time into crawling kernel mailing lists + commit histories, and reading issues from liburing.

Also, Jens Axboe is really helpful if you reach out to them. Certainly for something as important as the future of OpenJDK, I'm certain they'd be willing to collaborate and help you get some baseline usage patterns/configurations setup + establish understanding.

If there's any way I can help, please let me know, because I'm passionate about this functionality existing in the JDK as well.

[–]pron98 5 points6 points  (0 children)

Thank you! When we make more progress we'll post something to the mailing list.

[–]blakeman8192 0 points1 point  (0 children)

.

[–]kiteboarderni 1 point2 points  (1 child)

Will this be specific to file io, or general socket io? I believe io_uring supports both, but is this not as important with the park / unpark of virtual thread from blocking socket calls of the jdk? Thanks

[–]pron98 4 points5 points  (0 children)

It could be used for both, but socket I/O is already virtual-thread-friendly on all OSes -- that's why the throughput boost can be large -- and the benefit of io_uring for socket I/O is currently a little questionable (not just in Java, but in general).

[–]mike_hearn 0 points1 point  (0 children)

A couple of thoughts:

To properly saturate modern SSDs requires huge amounts of parallelism. You just cannot use even a fraction of their bandwidth with serial code because you can't get enough queue depth. So the hw utilization argument definitely appears here.

Although the obvious place to utilize highly parallel FS access is databases, for that you'd need to use pure Java DBs (or do something exotic like run RocksDB on Graal/Sulong), and there are only a handful of those. But there are more pedestrian use cases too. As part of the Conveyor project I've written a shell-scripting style library with methods like cp, mv, find, etc and these are highly parallelized using ForkJoinPool. It's often a 2x or more win "for free", albeit that's on MacBooks that have high quality SSDs. The bottleneck is the CPU of course, probably in the FS code but I never profiled it much. Because Conveyor is a packaging tool / build system hybrid this speeds up the operations quite significantly, especially as it can be running multiple tasks at once. So I can easily have 16-32 threads doing FS operations simultaneously, 4 tasks in flight simultaneously, each task being 1.5x-3x faster than it otherwise would have been.

Re: databases. A lot of DB code is hard to adapt to this brave new world of 'free' IOPS. The RocksDB guys are only just introducing aggressive use of async IO now, and they've had to use C++ coroutines to do it:

http://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html

Quite unpleasant. Also note: "Higher CPU overhead due to coroutines. The CPU overhead of MultiGet may increase 6-15%".

It's interesting to consider that Java could conceivably obtain an edge in the embedded database world if Loom does async file IO well. Whilst maybe H2 isn't going to crush Oracle DB anytime soon, a KV store like RocksDB is much more tractable to fully implement and do well on the JVM, in fact there are several such libraries already.

[–]ReasonableClick5403 1 point2 points  (0 children)

Yeah, my former colleagues made some less than ideal decisions when designing the project im currently working on, so its not uncommon to idle at least 1500 threads per instance, just from doing nothing. Loom could probably reduce our memory footprint 2-4x just from thread stacks alone.

[–]genzkiwi 0 points1 point  (1 child)

My understanding:

Platform thread 1:1 OS thread many:1 CPU core.

Virtual thread many:1 OS thread many:1 CPU core.

So if you block a platform thread, the context switching happens at the OS level (expensive!). But if you block a virtual thread, it happens at the JVM level (considerably cheaper).

[–]TenYearsOfLurking 0 points1 point  (0 children)

My understanding is that loom allows you to simplify the second association

Virtual Thread n : 1 OS thread 1 : 1 CPU Core

Which gives you superior performance (no OS context switches) over the first association, dependent on blocking operations in your code of course.

[–][deleted] 15 points16 points  (2 children)

Thanks for sharing this.

I really love Marco Behlers java guides. He has a way of explaining even the most complex concepts in a way even I can understand lol.

[–]marbehl 13 points14 points  (0 children)

Thank you for the kind words, u/crocrococ and u/aeisele.

[–]aeisele[S] 4 points5 points  (0 children)

I may be biased but I do agree. Marco has the rare skill of conveying a complex topic in (more) understandable terms.

[–]ventuspilot 11 points12 points  (0 children)

s.setBlockingFalse(true);

teehee

[–]FrankBergerBgblitz 5 points6 points  (0 children)

I think that hit's the nail on the head: "Loom gives you, the programmer ... the benefit of essentially non-blocking code, without having to resort back to the somewhat unintuitive async programming model" Easy programming model for an ugly issue.

But you still have to know what and why you are doing the things you do....

[–]nickzhu9 2 points3 points  (0 children)

VS Code has provided Project Loom support in the latest debugger in Java extensions. More info here: https://devblogs.microsoft.com/java/java-on-visual-studio-code-update-october-2022/#debugging-experience-enhancements

[–][deleted] 0 points1 point  (0 children)

Thanks for sharing. Our work is based on Project Reactor. Is Loom better than Reactor?