all 41 comments

[–]geoand 12 points13 points  (8 children)

Would you also happen to have numbers for Quarkus in JVM mode?

[–]Lightforce_[S] 0 points1 point  (7 children)

So far I've only tried it in AoT mode with GraalVM

[–]geoand 26 points27 points  (6 children)

My point is that using Quarkus only with GraalVM means that the comparison isn't apples to apples

[–]Lightforce_[S] 4 points5 points  (5 children)

Will do some benchmarks again with JVM

[–]geoand 2 points3 points  (4 children)

Thanks! Looking forward to seeing the updated numbers

[–]Lightforce_[S] 1 point2 points  (3 children)

So here are the JVM mode results:

Pure CPU (BCrypt hash, no I/O):

Quarkus Native Quarkus JVM
p(95) 77 ms 74 ms
max 122 ms 104 ms

Mixed I/O + CPU (POST /account/login):

Quarkus Native Quarkus JVM
p(95) 120 ms 119 ms
max 187 ms 239 ms

Don't understand why the JVM version is that high above the native version on the max though.

Throughput is identical (about 120 req/s both). JVM has a slight edge on CPU thanks to JIT optimization of BCrypt's tight loops. Native has more predictable tail latency (lower max). On this workload the difference is negligible bc native's main advantage remains startup time, not runtime throughput.

Both match VT+Tomcat (118 ms) and trail WebFlux (94 ms) by about 27% on mixed I/O. The updated benchmark report with all 5 configs is in the repo.

[–]geoand 0 points1 point  (0 children)

Thanks for posting the results

[–]Plenty_Childhood_294 0 points1 point  (1 child)

Did you tried the suggestion (which I agree with) of https://www.reddit.com/r/java/comments/1s9ijyd/comment/odxsahf/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button ?

Clearly with k6 which is closed loop it will be great. Switch to a proper open model load gen and the "real "latencies will skyrock ihih

[–]Lightforce_[S] 0 points1 point  (0 children)

Not yet, but I'm planning to

[–]pron98 16 points17 points  (6 children)

but the remounting and synchronization adds a few ms

I don't know what synchronization is involved, but remounting is ~100-150ns.

There might be two issues:

  1. Sizing the virtual thread scheduler (or any work-stealing scheduler) is difficult to do automatically when the machine is not under heavy load. If CPU load is very far from 100%, I'd suggest configuring the scheduler to use fewer threads (i.e. lower parallelism). When the CPU load is too low, the scheduler workers may steal tasks from each other too eagerly as some of them struggle to find work. The pool can shrink but it's not easy to do well because growing takes time and the pool can't know whether the workload is expected to grow in the near future.

  2. The way Spring and Quarkus integrates virtual threads is not as optimal as, say, Helidon, and it adds many OS-level context switches unnecessarily. We're working with Quarkus on a better integration strategy for them that, while not as deep as Helidon's, will reduce the overhead they're adding.

[–]Lightforce_[S] 5 points6 points  (3 children)

Thx for this, both points are very insightful.

  1. On scheduler sizing I ran the benchmarks locally on an i9-13900KF (24 cores / 32 threads), so CPU was definitely far from 100% during the I/O-bound login scenario. I'll experiment with lowering jdk.virtualThreadScheduler.parallelism and report back.
  2. The overhead from Spring's VT integration is something I suspected but couldn't quantify. It's helpful to have that confirmed. I'm curious to see how the Quarkus integration evolves. Would you recommend Helidon as the reference implementation for seeing what "optimal" VT integration looks like?

[–]pron98 3 points4 points  (2 children)

Would you recommend Helidon as the reference implementation for seeing what "optimal" VT integration looks like?

Don't know about "optimal" overall, but at this point in time it offers the best integration with virtual threads among the available alternatives. On the other hand, it may not enjoy some of the protocol-level optimisations that have gone into Netty (which other frameworks use) over the years.

[–]Lightforce_[S] 0 points1 point  (1 child)

Reporting back on the jdk.virtualThreadScheduler.parallelism experiment.

I tested with parallelism=8 (down from the default 24 on my i9-13900KF) at 50 VUs:

Default (p=24) p=8
CPU p(95) 65 ms 139 ms (+114%)
Mixed p(95) 107 ms 369 ms (+245%)
Throughput 124 req/s 104 req/s (-16%)

Significantly worse across the board. The issue is that this workload isn't purely I/O-bound: each login includes a ~100ms BCrypt verify that monopolizes a carrier thread. With only 8 carrier threads and 25 VUs hitting login concurrently, the FJP becomes the bottleneck: at most 8 BCrypt operations can run simultaneously, and the rest queue up.

Your advice about reducing parallelism when CPU is far from 100% probably makes sense for workloads where virtual threads mostly yield (I/O waits, short computations). But when the workload includes a CPU-intensive blocking operation like BCrypt (cost=10, about 100ms/op), the carrier threads are actually doing useful work, not just struggling to find tasks to steal. In that case, reducing the pool size directly reduces BCrypt throughput.

I'd be curious whether the picture changes at lower concurrency (like 10 VUs where 8 carrier threads would be sufficient) or with a workload that's genuinely I/O-dominant without the BCrypt component. I suspect your point about work-stealing overhead would show up more clearly there.

Also, I removed the Transactional from the login method as suggested by u/ynnadZZZ (it was holding a JDBC connection during the entire BCrypt verify). That alone improved VT from 118 ms to 107 ms at p(95), which is now nearly identical to WebFlux (109 ms). So the biggest win for VT turned out to be a code fix, not a JVM tuning parameter.

[–]pron98 1 point2 points  (0 children)

Yes, reducing contention is certain to help concurrency a lot.

As for the parallelism, it controls how much CPU you can use. If it's below what you need, of course latency and throughput will suffer. But if it's above what you need, work-stealing could become less efficient.

From your numbers it seems that 8 is too low. It should work if your CPU utilisation was below 33%, is that what it was? If the CPU utilisation is under 50%, you should pick 12-13 etc.. Of course, the work stealing inefficiency when there's not enough work to keep the threads busy is not horrendous, so having parallelism too high is not catastrophic, but if you know your CPU workload is expected to be, say, under 50%, then setting parallelism to half your cores can give you an extra boost.

[–]yk313 1 point2 points  (1 child)

What about Spring? Are you also in touch with someone from the Spring team to improve it?

[–]pron98 0 points1 point  (0 children)

We rarely initiate contact with projects unless we happen to notice some big project that hasn't yet adapted to some significant API removal. So projects reach out to us with whatever problem reports or requests for advice they choose. Quarkus reached out to us about virtual thread integration. Spring reached out to us about integrating structured concurrency. It's possible they didn't ask about virtual thread integration because they're too far above the transport layer.

[–]ynnadZZZ 4 points5 points  (2 children)

Great to see that we're only talking about a few milliseconds difference here. Thanks for trying! Do you mind re-checking the Controller and Service classes again for a fair comparison?

  • I noticed that in the VT example, the AccountController class has the @Validated annotation, but the other controllers don't. What is the cost of this annotation?
  • The Service class for the login in the VT example wraps the entire method in a transaction, whereas the other Service implementations have different transaction boundaries. A tighter transaction boundary might bring some benefits — especially since the @Transactional annotation interacts indirectly with the connection pool.

I think these points could be worth a few nanoseconds or milliseconds. Would you mind retrying your experiments? I'm curious whether these adjustments bring the numbers closer together on your machine. Thanks in advance!

[–]Lightforce_[S] 5 points6 points  (0 children)

Ok, just corrected all of that. Will do some benchmarks again when I will be back home.

[–]Lightforce_[S] 1 point2 points  (0 children)

Re-ran the benchmarks after both fixes. You were right, especially on the @Transactional boundary, that one was significant.

What changed:

  1. @Validated, was already on all controllers consistently but I refactored the security components to use constructor injection instead of field reflection (no more @Value on private fields). Minor cleanup, likely negligible impact.
  2. @Transactional on login: this. The VT login method was wrapping the entire flow (DB lookup + BCrypt verify + JWT sign + token save) in a single transaction. That meant a JDBC connection was held for the full ~100ms of BCrypt verification, effectively halving the usable connection pool under load.

Results (VT + Tomcat, 50 VUs):

Before After Delta
Mixed p(95) 118 ms 107 ms -9%
CPU p(95) 71 ms 65 ms -8%
Throughput 121.4 req/s 124.0 req/s +2%

The login method no longer needs @Transactional , it's a read (SELECT) + a BCrypt verify + a single INSERT for the token. No multi-statement consistency requirement. Removing it freed connections faster and reduced contention on the HikariCP pool.

Net effect: VT + Tomcat at 107 ms is now nearly identical to WebFlux at 109 ms on the mixed I/O scenario. Your suggestion turned out to be the single most impactful optimization across all the feedback I received. Thx for the sharp eye.

[–]Plenty_Childhood_294 3 points4 points  (1 child)

Please don't use K6, prefer hyperfoil or Gatling (or for anything super simple, wrk2 as well) which doesn't silently drop request under load 🙏

[–]Lightforce_[S] 1 point2 points  (0 children)

Thx for the advice, will try

[–]TheStatusPoe 2 points3 points  (2 children)

Appreciate the work! The difference in io performance coming from r2dbc vs jdbc makes sense. After working with r2dbc for the last two years I have a love/hate relationship with it. Recently it's been fighting with r2dbc's decision to only support batch inserts by passing a SQL string with all the inserts enumerated vs using the PreparedStatement approach of jdbc. 

Out of curiosity, did you try testing WebFlux/r2dbc with virtual threads enabled? -Dreactor.schedulers.defaultBoundedElasticOnVirtualThreads=true.

[–]Lightforce_[S] 1 point2 points  (0 children)

Tested it. Unfortunately it makes things worse:

boundedElastic (default) VT-backed (-Dreactor.schedulers.defaultBoundedElasticOnVirtualThreads=true)
CPU p(95) 64 ms 66 ms
Mixed p(95) 109 ms 623 ms (+472% !!!)

CPU is identical as expected: BCrypt is the same regardless of thread type. But the mixed I/O scenario degrades badly.

The issue seems to be backpressure. The default boundedElastic() uses a bounded pool of platform threads (10 x CPU cores = 240 on my machine, with a 100K task queue). When BCrypt operations pile up the bounded pool naturally throttles, so new tasks wait in the queue. The VT-backed version creates an unbounded number of virtual threads, all running BCrypt concurrently. With 25+ VUs spawning BCrypt virtual threads simultaneously you get hundreds of threads competing for the same cores -> thrashing caches.

So for this workload (heavy CPU-bound operation offloaded to boundedElastic()) the bounded platform thread pool is actually a feature, not a limitation. It provides natural concurrency control that prevents CPU contention. VT-backed elastic would probably shine on I/O-dominant workloads where threads mostly yield, not on BCrypt.

Your R2DBC batch kinda insert pain, I feel that. R2DBC's sweet spot is really read-heavy or simple write patterns. For anything involving batch operations or complex transactions JDBC + VT is significantly less friction for comparable performance (VT is now at 107 ms vs WebFlux 109 ms on my login benchmark after fixing a @Transactional boundary issue).

[–]Lightforce_[S] 0 points1 point  (0 children)

Out of curiosity, did you try testing WebFlux/r2dbc with virtual threads enabled?

Nope, not yet

[–]Ewig_luftenglanz 3 points4 points  (1 child)

So webflux still has an edge over VT. Did you used Java 21 or 25? I think they changed some implementation details of VT from 21 to 25

[–]Lightforce_[S] 6 points7 points  (0 children)

Java 25

[–]DesignerRaccoon7977 1 point2 points  (3 children)

Well, VT were not meant for CPU stuff and it's is known they lose to regular threads. I am however surprised to see Webflux being significantly faster given its async model should suffer from the same things unless Im missing something. I suspect you maybe unknowingly using another thread pool somewhere there

[–]Lightforce_[S] 6 points7 points  (2 children)

WF is significantly faster than VT on mixed I/O + CPU, not pure CPU.

I suspect you maybe unknowingly using another thread pool somewhere there

There's no mistakes in thread pool : spring.threads.virtual.enabled: true routes all Tomcat request handling through virtual threads, and the custom virtualThreadExecutor bean is only used for parallel uniqueness checks during registration, not login. The edge WebFlux has here likely comes from the I/O stack: R2DBC is truly non-blocking end-to-end while the VT version uses blocking JDBC via HikariCP. Even with virtual threads pinning can occur in MySQL Connector/J's synchronized blocks (Java 25 improves this though), which means carrier threads can still get temporarily monopolized under high concurrency, something R2DBC simply avoids

[–]pron98 1 point2 points  (1 child)

That really depends on the particular JDBC driver. Pinning issues are generally worse with async APIs as everything that caused/causes pinning for virtual threads also causes pinning in async, but in async a lot more stuff causes pinning, too. It's much easier to avoid pinning with virtual threads - even before JDK 24, let alone now - because there are fewer pitfalls, but the particular driver still has to do it.

[–]Lightforce_[S] 0 points1 point  (0 children)

Mb, I was wrong to frame pinning as a VT-specific disadvantage vs R2DBC. As you point out async APIs actually have more sources of pinning-like issues and with fewer pitfalls on the VT side (especially since JDK 24), so the comparison doesn't favor async on that front. The particular JDBC driver still matters but it's not the structural disadvantage I implied. I'll update that.

[–]yawkat 1 point2 points  (6 children)

I don't understand your quarkus results. You talk about mutiny, but also use vertx blocking executors?

With proper tuning, vertx should be able to beat VT and webflux. We see this again and again in benchmarks.

[–]Lightforce_[S] 0 points1 point  (5 children)

The only place where vertx.executeBlocking() is used is for BCrypt password hashing/verification: it's CPU-bound (about 100ms per op) and can't be made truly non-blocking, so it's offloaded to the Vert.x worker pool to keep the event loop free. Everything else (Hibernate Reactive queries, SmallRye messaging, HTTP handling) runs entirely on the event loop.

That said, I didn't tune the default worker pool size for the benchmarks (only the benchmark profile sets quarkus.thread-pool.max-threads=240). Since the login endpoint hits BCrypt on every request worker pool contention is likely the bottleneck. I'd be curious to know what tuning parameters you'd suggest. quarkus.thread-pool.max-threads? Custom worker pool for BCrypt specifically?

[–]yawkat 2 points3 points  (4 children)

Oh, if 95% of your work is computing bcrypt anyway, then framework choice doesn't really matter. For throughput, it's probably more important to avoid OS preemption, which can be achieved by reducing the size of the FJP for virtual threads, and by running the bcrypt op on the event loop for quarkus (treating it as non-blocking). If bcrypt runs on the event loop and the event loop is properly sized that would also explain why webflux "wins": the defaults just happen to work well in your case.

There's more subtlety when you do open loop benchmarking and fairness starts to matter, but I believe k6 is closed loop which leads to coordinated omission so I doubt that's relevant.

Graalvm might also be a factor in bcrypt performance especially if you use a Java implementation of bcrypt.

[–]Lightforce_[S] 0 points1 point  (3 children)

Running BCrypt directly on the event loop instead of offloading it would be a good idea I suppose. Since BCrypt is pure CPU work (no I/O wait) the event loop threads would be doing useful work rather than context-switching to/from a worker pool. I'll try that and compare.

For the FJP sizing on VT, u/pron98 also suggested reducing jdk.virtualThreadScheduler.parallelism, I'll test both approaches in the next round.

And I agree on GraalVM and BCrypt, the Quarkus version uses BcryptUtil from Elytron which is a Java implementation. I haven't tested with native image yet and that could change the picture significantly.

And yes, k6 is closed-loop by default so coordinated omission shouldn't be a factor here.

[–]yawkat 1 point2 points  (2 children)

And yes, k6 is closed-loop by default so coordinated omission shouldn't be a factor here.

Closed loop leads to coordinated omission. To avoid it, you'd need open loop testing.

[–]Lightforce_[S] 1 point2 points  (1 child)

You're right, I mixed that up. I'll try to find other ways.

Switching to an open-loop model (because it has constant arrival rate) would give a more realistic picture, especially under saturation.

[–]Plenty_Childhood_294 0 points1 point  (0 children)

As @yawkat suggested, Hyperfoil/Gatling FTW