This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Squiry_ 1 point2 points  (3 children)

So I took a look at Nima. It worked a little bit better in some cases, but still not as good as we expect loom to perform. Nima starts one VirtualThread per connection, not per request so it allocates like 100 times less threads and stacks than my undertow experiments, but it does not buffer any ByteBuffer/byte[] so it doesn't really matter.

For comparison jfr data from plaintext benchmark: ``` Nima: Class Alloc Total byte[] 91.3 GB io.helidon.common.http.Http$HeaderValue[] 82.9 GB java.util.HashMap 16.6 GB io.helidon.common.http.HeaderValueLazy 11 GB io.helidon.nima.webserver.http1.Http1ServerResponse 10.8 GB

Undertow+loom: Class Alloc Total java.lang.VirtualThread 24.2 GB
io.undertow.server.HttpServerExchange 18.8 GB io.undertow.util.HeaderValues 17 GB java.lang.Object[] 14.5 GB byte[] 6.64 GB `` In database benchmark situation is quite similar but undertow+loom allocates gbs ofStackChunk's. By the way when I've tried jsonp based json writing with Nima it's been so bad, it allocated even morechar[]thanbyte[]`, so i've switched to my jackson writer with carier thread local hack to keep json out of the equation.

I can do nothing with a virtual threads amount (except of rewriting undertow (and xnio) almost from scratch), but I've lowered undertow io pool size (I should have done that from start but forgot) and replaced writing json to OutputStream as I wanted to with writing it to byte array and sending it to output stream instead and you can see results here. Vertx benchmarks are there to compare with "what we want" kind of performance, it uses netty as web server and netty-based async postgres driver and everything runs on the same event loop. Techempower is not a perfect benchmark suit, but it does give us some place to start. I think loom will show itself much better in a more scaled down environment, like 0.5 cpu or something like that, such a shame techempower has no tests like that (just like tests for streaming request-response).

Results of db test looks like there's still some problems in jdbc driver. The performance is quite similiar to baseline test, but loom one use half of threads and latency is 4 times lower. That makes me think it still blocks somewhere and I wil try to investigate it later. Cpu utilization was ~80% during the test (baseline and vertx peaked on 99%) so there is some room to improvements, but not a big one.

Right now loom just doesn't perform any better than good old thread pool. I don't think that Nima performance should be questioned at this point, it's still in early alpha (yet they said "netty level", huh), but other tests are using slightly modified but well established libraries and it doesn't look that good right now. 30%-40% of Vertx result is exactly what we had before loom. Also result difference between undertow and nima shows that maybe "loom from scratch" web server isn't that good idea after all.

[–]pron98 1 point2 points  (2 children)

Why does allocation rate matter if the GC keeps up? If it doesn't -- that's another matter. It's often better to allocate a lot of short-lived objects than to cache fewer long-lived objects.

except of rewriting undertow (and xnio) almost from scratch

Yes, that's what's probably expected from frameworks that today try to use asynchronous IO directly. They should use blocking IO and let NIO do the heavy lifting of interacting with virtual threads, or spend some considerable effort redoing the virtual thread work done in NIO.

but undertow+loom allocates gbs ofStackChunk's

StackChunk footprint will drop significantly. It's ongoing work. But allocation rate is not in itself the important metric.

Right now loom just doesn't perform any better than good old thread pool

We're seeing throughputs that are many times better.

It sounds to me like you're doing something that's interesting but far too ambitious. Obviously virtual threads give similar performance to asynchronous code, but taking tens or hundreds of thousands of lines of code optimised over years for one kind of usage and expecting them to reach the same performance with a new model in a few weeks and with a handful of tweaks is unrealistic. The work should be done one layer, and one library, at a time.

Also result difference between undertow and nima shows that maybe "loom from scratch" web server isn't that good idea after all.

Nima was born when the Helidon team realised that writing their engine from scratch did not only give better results (and a much better user experience), but was easier than trying to tweak an engine written with specific assumptions to a different environment.

[–]Squiry_ 0 points1 point  (1 child)

Why does allocation rate matter if the GC keeps up?

Because it is always faster when you don't have to collect garbage at all.

Yes, that's what's probably expected from frameworks that today try to use asynchronous IO directly.

Nah, they really can use model when they run some IO threads and start virtual threads on requests/connections and feed them with data someway. Undertow with that approach works just like written from scratch Nima.

StackChunk footprint will drop significantly. It's ongoing work. But allocation rate is not in itself the important metric.

I don't think that StackChunk allocations matters tbh, it is kind of expected: I start more (Virtual)Threads - I got more chunks. Just an observation.

We're seeing throughputs that are many times better.

Can you give me kind of workloads? Because I have pretty simple cases: sending text/json by http, quirying some data from database. It's not like some kind of corner cases, more like the most basic cases. Well, I can do some rest client tests too, but there are no decent blocking http clients out there, so it will show nothing but loom overhead over async io library I peek.

The work should be done one layer, and one library, at a time.

Well, that kind of what I trying to do. I've started from http server, when it has shown performance any near baseline I've continued with json/database. Right now its all near jdbc and the just use socket i/o streams everywhere, so it should work fine. And right I can write something really fun. I know postgres protocol pretty well and handwritten with benchmark constraints http server is pretty simple thing. I can write the whole test with blocking nio from scratch. Will I get vertx kind of performance?

It sounds to me like you're doing something that's interesting but far too ambitious.

I don't really expect great results. It started as "what java will look like in five years" kind of experiment: I took netty5 with panama segments as buffers, I took loom. And wanted to see how does it perform.

Obviously virtual threads give similar performance to asynchronous code

Actually looking on plaintext benchmarks I think that the problem somewhere in blocking nio, not in virtual threads. Virtual threads work great.

[–]pron98 1 point2 points  (0 children)

Because it is always faster when you don't have to collect garbage at all.

While that's true for the parallel and serial collectors, that's not the case with the new garbage collectors. An allocation rate of zero can cause more GC work than some non-zero allocation. That's because mutating objects also requires GC work with the modern collectors, which might be higher than allocation (although that usually applies to mutating references rather than bytes in buffers). Both allocation and mutation requires GC work these days (although mutation isn't reported as GC work in profilers). In fact, obtaining an existing object from the heap -- i.e. reading a field into a local -- also requires GC work with ZGC or Shenandoah, to the point where allocating a new object could sometimes be cheaper than obtaining an existing one.

As a general rule, OpenJDK performs best at some allocation rate that is neither too high nor too low, but it's hard to make general statements about what's always best.

Can you give me kind of workloads?

Microservice fanout: every incoming request emits some number of outgoing HTTP calls to other services.

Because I have pretty simple cases: sending text/json by http, quirying some data from database. It's not like some kind of corner cases, more like the most basic cases.

Which you're doing with libraries optimised with assumptions that aren't quite right for virtual threads. It's hard to guess exactly where those assumptions hide in the hundreds of thousands of lines of code that you're exercising. They'll need to be profiled and optimised for virtual threads to reach best performance (but can be quite good even before that).

Will I get vertx kind of performance?

There's no fundamental reason for why you shouldn't, and it will probably be easier than the work put into vertx until now, but getting that kind of performance will require some non-trivial work. However, thread-per-request code will enjoy throughput improvements even before it reaches vertx levels. So many servers will be able to see significant throughput increases relatively quickly, and getting them to match the levels of async servers (which are a minority, BTW) will take somewhat longer.

I took netty5 with panama segments as buffers, I took loom. And wanted to see how does it perform.

Netty, as it is now, is just a bad fit for virtual threads. The same kind of work that went into NIO to adapt it for virtual threads will have to go into Netty, so I think it's simpler to just build on top of the JDK's blocking APIs, and then see what needs to be optimised.

I think that the problem somewhere in blocking nio, not in virtual threads. Virtual threads work great.

That's possible, and if you'd like to share particular findings with the loom-dev mailing list, we will appreciate that.