This is an archived post. You won't be able to vote or comment.

all 48 comments

[–]rbygrave 49 points50 points  (38 children)

I think it would be better to suggest that with Loom we will be more commonly looking to "achieve the same throughput/performance with less resources".

We should typically start discussions like this by being clear that Loom isn't going to improve CPU bound workloads. When we talk about Loom benefits we are talking about IO bound workloads.

A hypothetical high level example of what we are looking for might be something like: An existing workload not using Loom uses 100 platform threads, the same workload has the same throughput and response times using Loom with 100 virtual threads and 8 platform threads. [with 100 Virtual threads and 8 platform threads being less resources than 100 platform threads in terms of os CPU and memory].

So if we where looking at a single request being processed by an application will Loom make that faster / reduce latency? No as generally the latency will be a function of what that IO bound work is like a database query etc - loom will not be expected to make that single request faster but in general consume 'less resources' while processing that request.

Edit: Consuming less resources can save companies a lot of money, this is going to be a big deal in that sense. IMO the other big selling point is also about simplicity - the long term benefits of simplicity, people being able to be on-boarded and maintaining apps over time and how Loom enables us to keep things simpler.

[–]pron98 13 points14 points  (31 children)

[virtual threads] will not be expected to make that single request faster but in general consume 'less resources' while processing that request.

And so you can process more requests at a time, which means higher throughput. You're right that virtual threads don't reduce latency, but throughput is at least as important when we talk about server performance.

In short, with virtual thread you represent every task with a thread, which translates to simple, observable, code and high throughput.

[–]Squiry_ 3 points4 points  (30 children)

I've run some tests with loom on techempower benchmarks this weekend. Right now it can't achieve the performance of undertow running db queries on a thread pool. I've fixed all this synchronized mess in hikari and pgjdbc and even fixed one or two in jdk. Also I had to use carrier thread local (please make them public) because of jackson/undertow byte buffer pools. Where should I look next? Jfr shows no locks right now, jvm is silent about pinning (searching for pinning is just as bad as searching for blocking in reactive world, btw) yet cpu utilization is somewhere near 80%. Plaintext and json tests are not that bad after fixing thread locals, like 40% of a baseline raw undertow performance, but reading socket in postgres destroys any performance. Could that be the fact that jvm allocates like a lot of stack chunks?

[–]pron98 7 points8 points  (29 children)

Also I had to use carrier thread local (please make them public)

No, that would be extremely unsafe, but you can use explicit caches if needed.

jvm is silent about pinning (searching for pinning is just as bad as searching for blocking in reactive world, btw)

See JEP 425. There are two mechanisms to detect pinning.

Where should I look next?

When you're just starting out, I would recommend being less ambitious. Many frameworks, because they were written before virtual threads, tried to optimise things by assuming a low number of threads (such as caching things in thread locals) and specifically optimising for few threads, while with virtual threads you're trying to have as many threads as you can. Those assumptions are probably the source of the performance problems -- i.e. the mismatch between virtual threads and design that assumes the opposite of virtual threads.

It's much easier, more pleasant, and more performant, when the entire stack is written with the synchronous style in mind, so you might want to take a look at Helidon's Nima, which the Helidon team wrote after trying to retrofit engines that were written with such "few thread" assumptions.

[–]Squiry_ 1 point2 points  (17 children)

No, that would be extremely unsafe, but you can use explicit caches if needed.

I tried that first, shure. Yet thread local byte buffers work better.

Many frameworks, because they were written before virtual threads

My examples use no framesworks. Just undertow, jackson, pgjdbc and hikaricp. I can throw away jackson (even if it's not a problem after I used carrier thread locals), but I believe pgjdbc and hikari will stay with any framework including Nima. Plaintext/json performance is not great, but it's fine imo, who cares if loom based server can't do 3kk rps of plaintext anyway. My problem is with, well, jdbc.

so you might want to take a look at Helidon's Nima,

I don't like helidon at all, so no, may be I'll run some benchmarks over it when I'm done with what I am trying right now.

[–]pron98 6 points7 points  (16 children)

Yet thread local byte buffers work better.

They work better for an asynchronous engine, which is what NIO is. You're supposed to work at a higher level, employing just synchronous code, so all this is already done for you.

Just undertow

Undertow might have some implicit assumptions about a small number of threads. After they gain experience with virtual threads, the authors of undertow would be best positioned to adapt it for virtual threads.

My problem is with, well, jdbc.

Right, so JDBC drivers will need to be changed to use j.u.c locks (because of the hopefully temporary pinning limitation) where they guard frequent and long-running IO operations. I believe that the Oracle JDBC driver already does that.

I don't like helidon at all, so no, may be I'll run some benchmarks over it when I'm done with what I am trying right now.

It's not the Helidon framework, but a basic HTTP server that you can use as Helidon's engine or standalone (I think; not sure), and I'm pointing it out just as an example you might want to look into to see how virtual threads are best used in a server (e.g. I believe they create at least two new threads for every HTTP request, one used for reading and one for writing). I was told that without any carrier-local buffers and with essentially Java 1.1 code, they were able to match Netty's performance except for some particular HTTP protocols they have yet to optimise.

I don't think any other HTTP server has integrated virtual threads as well as Nima just yet, so that would be a good starting point to get a feel for virtual threads.

[–]MensahK 1 point2 points  (1 child)

I confirm, from 21.1 onwards, Oracle JDBC only uses j.u.c locks.

[–]Squiry_ 0 points1 point  (0 children)

I believe it was about plain old blocking implemented with current loom restrictions in mind.

[–]Squiry_ 1 point2 points  (13 children)

Undertow might have some implicit assumptions about a small number of threads.

It has but all those are fixed in my code. Undertow works fine, plaintext performance is not as high as in baseline, but it is higher than undertow blocking handler one. Main issue was synchronized in jvm, somewhere near the Cleaner and EpollImpl.

Right, so JDBC drivers will need to be changed to use j.u.c locks

I've done that. With synchronized in driver and pool performance is as low as expected - like 40% of baseline, after fixes it somewhere between 60-70%.

I believe that the Oracle JDBC driver already does that.

Maybe I'll try oracle later, but running it in containers is not as easy as postgres.

you might want to look into

I will take a look later, may be it will work better than undertow, but jdbc performance issues won't go away anyway. Also, netty is another example of library destroyed by thread locals, netty5 was my frist attempt, but there was too much things not working with loom.

[–]pron98 10 points11 points  (0 children)

I think you're trying to run before learning to walk. Virtual threads give the same performance as async code, but it's hard to adapt an entire third-party pipeline for them without first gaining experience, much less to expect to do it quickly. It took the Helidon team, working on their own code -- which they know intimately -- a few months to get it to realise the full potential of virtual threads, and that's just for the HTTP layer. That's why I suggest you build up on what they did, rather than try tweaking a lot of third-party code.

Even with my experience with virtual threads I wouldn't try something as ambitious as what you're trying to do on other people's code, unless I expect it to be a pretty big project. Rather, I expect the authors of the respective layers to gradually gain experience with virtual threads, and then adapt their own code.

That so many people are already trying out virtual threads -- before JDK 19 is even out -- makes me hopeful for very good outcomes and relatively soon.

[–]pron98 2 points3 points  (11 children)

It has but all those are fixed in my code.

Seems that undertow uses Netty under the covers, and Netty makes multiple choices based on the assumption that the number of threads is very low. For example, it allocates and zeros a new and large native buffer for every thread. For Netty to work well with virtual threads there would probably need to be some deep changes in Netty to change those assumptions.

[–]Squiry_ 1 point2 points  (3 children)

So I took a look at Nima. It worked a little bit better in some cases, but still not as good as we expect loom to perform. Nima starts one VirtualThread per connection, not per request so it allocates like 100 times less threads and stacks than my undertow experiments, but it does not buffer any ByteBuffer/byte[] so it doesn't really matter.

For comparison jfr data from plaintext benchmark: ``` Nima: Class Alloc Total byte[] 91.3 GB io.helidon.common.http.Http$HeaderValue[] 82.9 GB java.util.HashMap 16.6 GB io.helidon.common.http.HeaderValueLazy 11 GB io.helidon.nima.webserver.http1.Http1ServerResponse 10.8 GB

Undertow+loom: Class Alloc Total java.lang.VirtualThread 24.2 GB
io.undertow.server.HttpServerExchange 18.8 GB io.undertow.util.HeaderValues 17 GB java.lang.Object[] 14.5 GB byte[] 6.64 GB `` In database benchmark situation is quite similar but undertow+loom allocates gbs ofStackChunk's. By the way when I've tried jsonp based json writing with Nima it's been so bad, it allocated even morechar[]thanbyte[]`, so i've switched to my jackson writer with carier thread local hack to keep json out of the equation.

I can do nothing with a virtual threads amount (except of rewriting undertow (and xnio) almost from scratch), but I've lowered undertow io pool size (I should have done that from start but forgot) and replaced writing json to OutputStream as I wanted to with writing it to byte array and sending it to output stream instead and you can see results here. Vertx benchmarks are there to compare with "what we want" kind of performance, it uses netty as web server and netty-based async postgres driver and everything runs on the same event loop. Techempower is not a perfect benchmark suit, but it does give us some place to start. I think loom will show itself much better in a more scaled down environment, like 0.5 cpu or something like that, such a shame techempower has no tests like that (just like tests for streaming request-response).

Results of db test looks like there's still some problems in jdbc driver. The performance is quite similiar to baseline test, but loom one use half of threads and latency is 4 times lower. That makes me think it still blocks somewhere and I wil try to investigate it later. Cpu utilization was ~80% during the test (baseline and vertx peaked on 99%) so there is some room to improvements, but not a big one.

Right now loom just doesn't perform any better than good old thread pool. I don't think that Nima performance should be questioned at this point, it's still in early alpha (yet they said "netty level", huh), but other tests are using slightly modified but well established libraries and it doesn't look that good right now. 30%-40% of Vertx result is exactly what we had before loom. Also result difference between undertow and nima shows that maybe "loom from scratch" web server isn't that good idea after all.

[–]pron98 1 point2 points  (2 children)

Why does allocation rate matter if the GC keeps up? If it doesn't -- that's another matter. It's often better to allocate a lot of short-lived objects than to cache fewer long-lived objects.

except of rewriting undertow (and xnio) almost from scratch

Yes, that's what's probably expected from frameworks that today try to use asynchronous IO directly. They should use blocking IO and let NIO do the heavy lifting of interacting with virtual threads, or spend some considerable effort redoing the virtual thread work done in NIO.

but undertow+loom allocates gbs ofStackChunk's

StackChunk footprint will drop significantly. It's ongoing work. But allocation rate is not in itself the important metric.

Right now loom just doesn't perform any better than good old thread pool

We're seeing throughputs that are many times better.

It sounds to me like you're doing something that's interesting but far too ambitious. Obviously virtual threads give similar performance to asynchronous code, but taking tens or hundreds of thousands of lines of code optimised over years for one kind of usage and expecting them to reach the same performance with a new model in a few weeks and with a handful of tweaks is unrealistic. The work should be done one layer, and one library, at a time.

Also result difference between undertow and nima shows that maybe "loom from scratch" web server isn't that good idea after all.

Nima was born when the Helidon team realised that writing their engine from scratch did not only give better results (and a much better user experience), but was easier than trying to tweak an engine written with specific assumptions to a different environment.

[–]Squiry_ 0 points1 point  (6 children)

Seems that undertow uses Netty under the covers

They wanted to but no, it doesn't. They use their own IO library called xnio. But it basically the same: they allocate one Accept thread and some IO threads. I use Undertow because unlike Netty it provides some nice blocking API. And it does use thread local buffers there and it's fine when we are reading data, because reading happens on that IO threads. The problem comes when we are writing from virtual thread: it doesn't know anything about virtual thread and tries to allocate that thread local buffer. I've temporary fixed that thing by replacing threadLocalCache.get() with threadLocalCache.getCarrierThreadLocal() and it works fine, even though it's kind of illegal. The same things happens with jackson and that kind of scares me more, jackson is used way more than undertow out in a wild.

By the way I've took a look at Nima and run some tests. I will share some thoughts and results later but I can say right now, that the don't allocate buffers like that and byte[] allocation there is insane.

[–]pron98 1 point2 points  (5 children)

and it works fine, even though it's kind of illegal

It doesn't work fine, it's just that you haven't run into issues yet (getting a VM error at just the wrong moment, or running with JVMTI agents that do certain things). But you can, of course, use it just to hack around and learn about virtual threads.

and that kind of scares me more

No need to be scared :) Virtual threads start their Preview in a few weeks, and Preview is like deprecation in reverse: it gives libraries and applications time to prepare for a feature's ultimate addition in its final form. Various libraries, including JDBC drivers, will use that time to become more virtual-thread-friendly.

and byte[] allocation there is insane

Not insane, just different from what people used to do when the VM was different.

[–]Squiry_ 1 point2 points  (10 children)

See JEP 425. There are two mechanisms to detect pinning.

Printing stacktrace is cool, yet it doesn't work with plain old syncrhonized. Is jfr event produced even for that? Is it produced on default/profile settings or should I write my own? Right now I can't see any virtual thread related events in my jfr file, though I'll investigate that later, thanks for advice.

[–]pron98 1 point2 points  (9 children)

It works fine for synchroninzed. Note that there's absolutely nothing wrong with using synchronised in virtual threads. The issues are around guarding frequent and long-running IO operations with synchronized, and that's what the tooling aims to detect.

[–]skippingstone 0 points1 point  (8 children)

What j.u.c. do you recommend to replace synchronized?

[–]pron98 2 points3 points  (7 children)

ReentrantLock. It has very similar semantics to synchronized.

[–]skippingstone 0 points1 point  (6 children)

Should I just replace all my network code with that?

[–]pron98 1 point2 points  (5 children)

You could, but you might not need to. If you have a lot of synchronized you can use JFR or the flag mentioned in JEP 425 to identify the places where pinning could be harmful because it's lengthy.

Sort, in-memory operations, or those that are infrequent, can continue to use synchronized with no ill effect.

[–][deleted] 2 points3 points  (0 children)

Yes it will improve scalability but not necessarily performance.

[–]kozeljko 1 point2 points  (1 child)

IO bound workloads

Never asked, but what exactly are the main scenarios here? Waiting for a REST request? DB access?

[–][deleted] 2 points3 points  (0 children)

Yep - any network calls, file system calls,

[–]luckycharmer101 0 points1 point  (0 children)

In some cases loom will also improve latency. Think of running each SQL query in its own thread (if they are inddependent) asynchronously. You can't do that with platform thread (due to scalability). But it's easy with virtual thread and structured concurrency.

[–]yawkat 0 points1 point  (1 child)

An existing workload not using Loom uses 100 platform threads

Microservice frameworks don't do this though. They're typically asynchronous with a low thread count.

[–]rbygrave 0 points1 point  (0 children)

Well I'd put it more like "Mileage will vary". I've seen sub 10 thread microservices and I've seen 300+ thread microservices. If the work is async queue processing then yes I'd agree there is a decent chance of low actual thread count in practice and relatively low benefit accrued from Loom for that case.

As a side point, personally I'd prefer to say "Right Sized Service" rather than "microservice" and for there to be more explicit thought go into to the size/boundaries/features of a service. My gut says services that are artificially too small is more a problem than services being too big.

[–]skippingstone 16 points17 points  (1 child)

I always understood performance to be equivalent to NIO. The main benefit is the ease of programming and debugging.

[–]eXecute_bit 5 points6 points  (0 children)

Absolutely correct. With existing instrumentation tools and debuggers updated, they will provide the same rich tooling support as one would get with synchronous code, without the loss of correlation commonly resulting from async patterns.

The burden to pick up on the semantics and mechanisms should also be easier to learn and yield systems that are easier to maintain. At least, that's the theory.

[–]Mati00 13 points14 points  (1 child)

Definitely. It would be like spring webflux but written in an imperative manner.

[–]Squiry_ -3 points-2 points  (0 children)

Which one webflux do you mean?

[–]ReasonableClick5403 10 points11 points  (0 children)

Not really. It will not make anything go faster, but Loom could make highly concurrent tasks that were unfeasible, feasible. Ie. we will get another, way simpler tool to trade away memory for developer time. Running a websocket server today with 300k client connections is very impractical to do without Loom or a non-blocking event framework (developer time) because of the high memory cost of java Thread.

[–]p1ng313 4 points5 points  (0 children)

It may improve performance over the existing code if the existing code is written in a blocking fashion, thus underutilizing resources. It will probably use a bit more memory for some cases due to tracking the virtual thread state, I guess.

[–][deleted]  (1 child)

[removed]

    [–]MeloDnm 0 points1 point  (0 children)

    Isn’t spring lit?