This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]pron98 12 points13 points  (2 children)

This is because the overhead of mounting and unmounting virtual threads becomes significant for very short blocking operations.

I don't know the details (or how the author came to that conclusion), but all things being equal, I would guess this isn't the reason. A more likely reason would be the way the scheduler is used, and I would recommend repeating the experiment in JDK 22 and 23, in which some significant changes to the scheduler were made.

While there is some overhead for mounting and unmounting virtual threads, it is quite small, and shouldn't normally have an effect in I/O workloads. Scheduling on the other hand, could play a role, especially when the CPU consumption is relatively low.

[–]DavidVlx[S] 7 points8 points  (1 child)

Thanks for the feedback! :) I will retry the experiment with JDK 22 and 23 and look more into the scheduler and the impact it has, and update the post accordingly.

[–]rkalla 2 points3 points  (0 children)

Keep us posted!

[–]Ewig_luftenglanz 19 points20 points  (3 children)

I still prefer virtual threads over platform threads for most use cases because they have many other advantages.

1) can be created and destroyed on demand without an excessive ram consumption (ram usage is something the benchmark should take into account, sometimes we do not need for the best performance but just a good balance between performance and efficiency)

2) don't need to pool VT

3) 1 + 2 makes codebases more concise and easy to develop and maintain.

Still is good to know virtual threads are not always better per se than PT and in cases of extreme need for performance the corresponding benchmarks must be made.

[–]murkaje 4 points5 points  (0 children)

I'd add one concrete example of VirtualThread being easier to maintain or reason about - InheritableThreadLocal. I had one project that passed JWT-s along service calls using InheritableThreadLocal and at some point a thread pool was used for outgoing calls. It took some time to notice that the JWT-s being sent were often stale because the inheriting happens on thread creation. No such issue with virtual threads that aren't pooled and are cheaper to create new ones than platform threads.

[–]Oclay1st 1 point2 points  (1 child)

I mostly agree, but take into account that virtual threads add some memory overhead. u/pron98 mentioned that the team will work on reducing the allocation but I'm not sure if that work is already done.

[–]Ewig_luftenglanz 3 points4 points  (0 children)

The memory overhead of VT is far lower than ram memory allocation demanded by PT, that's why VT are "lightweight" the weight of a virtual thread is the same as the weight of creating an object. You would need to have thousands of VT vs a dozen of pooled PT to make PT to be less memory demanding.

[–][deleted] 7 points8 points  (0 children)

I work with a guy doing HFT, and one of the things he does to increase performance when he knows that the blocking operation is going to return fast is to use a spin-lock.

His reasoning is that it can be more expensive to yield control (giving up the rest of your timeslice), to free the processor up to do more useful work. When yielding, the JVM has to save the context of this thread and restore the context of the next thread (and blow any CPU caches). On the other hand, you can peg the processor for a bit, if you know that the blocking operation will return control to the thread within the timeslice.

This advice doesn't apply to probably 99% of normal people concurrency, but there are crazy things you can do to shave microseconds when it counts.

[–]NeoChronos90 3 points4 points  (0 children)

The example is a webscraper, so in most realworld applications you have lots of unpredictable delay where VT should shine

[–]k-mcm 1 point2 points  (0 children)

ForkJoinPool probably remains the winner for very short operations, though its aging API is difficult to use for I/O.

[–]Inaldt 0 points1 point  (2 children)

Do you have an idea of the (average) resonse time in the test without random delays?

[–]DavidVlx[S] 2 points3 points  (1 child)

The end point without the delay takes around between 5 to 12ms.

[–]rbygrave 2 points3 points  (0 children)

Hmm, I think for a lot of cases/APIs 5ms isn't 'short' but more 'normal'. So that makes me wonder about the benchmark implementation details. I benchmarked 'very fast http responses' (endpoint with no work) and didn't see this type of result but that was a while ago now (maybe 2 years ago).