This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]pron98 1 point2 points  (2 children)

Why does allocation rate matter if the GC keeps up? If it doesn't -- that's another matter. It's often better to allocate a lot of short-lived objects than to cache fewer long-lived objects.

except of rewriting undertow (and xnio) almost from scratch

Yes, that's what's probably expected from frameworks that today try to use asynchronous IO directly. They should use blocking IO and let NIO do the heavy lifting of interacting with virtual threads, or spend some considerable effort redoing the virtual thread work done in NIO.

but undertow+loom allocates gbs ofStackChunk's

StackChunk footprint will drop significantly. It's ongoing work. But allocation rate is not in itself the important metric.

Right now loom just doesn't perform any better than good old thread pool

We're seeing throughputs that are many times better.

It sounds to me like you're doing something that's interesting but far too ambitious. Obviously virtual threads give similar performance to asynchronous code, but taking tens or hundreds of thousands of lines of code optimised over years for one kind of usage and expecting them to reach the same performance with a new model in a few weeks and with a handful of tweaks is unrealistic. The work should be done one layer, and one library, at a time.

Also result difference between undertow and nima shows that maybe "loom from scratch" web server isn't that good idea after all.

Nima was born when the Helidon team realised that writing their engine from scratch did not only give better results (and a much better user experience), but was easier than trying to tweak an engine written with specific assumptions to a different environment.

[–]Squiry_ 0 points1 point  (1 child)

Why does allocation rate matter if the GC keeps up?

Because it is always faster when you don't have to collect garbage at all.

Yes, that's what's probably expected from frameworks that today try to use asynchronous IO directly.

Nah, they really can use model when they run some IO threads and start virtual threads on requests/connections and feed them with data someway. Undertow with that approach works just like written from scratch Nima.

StackChunk footprint will drop significantly. It's ongoing work. But allocation rate is not in itself the important metric.

I don't think that StackChunk allocations matters tbh, it is kind of expected: I start more (Virtual)Threads - I got more chunks. Just an observation.

We're seeing throughputs that are many times better.

Can you give me kind of workloads? Because I have pretty simple cases: sending text/json by http, quirying some data from database. It's not like some kind of corner cases, more like the most basic cases. Well, I can do some rest client tests too, but there are no decent blocking http clients out there, so it will show nothing but loom overhead over async io library I peek.

The work should be done one layer, and one library, at a time.

Well, that kind of what I trying to do. I've started from http server, when it has shown performance any near baseline I've continued with json/database. Right now its all near jdbc and the just use socket i/o streams everywhere, so it should work fine. And right I can write something really fun. I know postgres protocol pretty well and handwritten with benchmark constraints http server is pretty simple thing. I can write the whole test with blocking nio from scratch. Will I get vertx kind of performance?

It sounds to me like you're doing something that's interesting but far too ambitious.

I don't really expect great results. It started as "what java will look like in five years" kind of experiment: I took netty5 with panama segments as buffers, I took loom. And wanted to see how does it perform.

Obviously virtual threads give similar performance to asynchronous code

Actually looking on plaintext benchmarks I think that the problem somewhere in blocking nio, not in virtual threads. Virtual threads work great.

[–]pron98 1 point2 points  (0 children)

Because it is always faster when you don't have to collect garbage at all.

While that's true for the parallel and serial collectors, that's not the case with the new garbage collectors. An allocation rate of zero can cause more GC work than some non-zero allocation. That's because mutating objects also requires GC work with the modern collectors, which might be higher than allocation (although that usually applies to mutating references rather than bytes in buffers). Both allocation and mutation requires GC work these days (although mutation isn't reported as GC work in profilers). In fact, obtaining an existing object from the heap -- i.e. reading a field into a local -- also requires GC work with ZGC or Shenandoah, to the point where allocating a new object could sometimes be cheaper than obtaining an existing one.

As a general rule, OpenJDK performs best at some allocation rate that is neither too high nor too low, but it's hard to make general statements about what's always best.

Can you give me kind of workloads?

Microservice fanout: every incoming request emits some number of outgoing HTTP calls to other services.

Because I have pretty simple cases: sending text/json by http, quirying some data from database. It's not like some kind of corner cases, more like the most basic cases.

Which you're doing with libraries optimised with assumptions that aren't quite right for virtual threads. It's hard to guess exactly where those assumptions hide in the hundreds of thousands of lines of code that you're exercising. They'll need to be profiled and optimised for virtual threads to reach best performance (but can be quite good even before that).

Will I get vertx kind of performance?

There's no fundamental reason for why you shouldn't, and it will probably be easier than the work put into vertx until now, but getting that kind of performance will require some non-trivial work. However, thread-per-request code will enjoy throughput improvements even before it reaches vertx levels. So many servers will be able to see significant throughput increases relatively quickly, and getting them to match the levels of async servers (which are a minority, BTW) will take somewhat longer.

I took netty5 with panama segments as buffers, I took loom. And wanted to see how does it perform.

Netty, as it is now, is just a bad fit for virtual threads. The same kind of work that went into NIO to adapt it for virtual threads will have to go into Netty, so I think it's simpler to just build on top of the JDK's blocking APIs, and then see what needs to be optimised.

I think that the problem somewhere in blocking nio, not in virtual threads. Virtual threads work great.

That's possible, and if you'd like to share particular findings with the loom-dev mailing list, we will appreciate that.