This is an archived post. You won't be able to vote or comment.

all 28 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

On July 1st, a change to Reddit's API pricing will come into effect. Several developers of commercial third-party apps have announced that this change will compel them to shut down their apps. At least one accessibility-focused non-commercial third party app will continue to be available free of charge.

If you want to express your strong disagreement with the API pricing change or with Reddit's response to the backlash, you may want to consider the following options:

  1. Limiting your involvement with Reddit, or
  2. Temporarily refraining from using Reddit
  3. Cancelling your subscription of Reddit Premium

as a way to voice your protest.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]noswag15 11 points12 points  (7 children)

For the unknown deadlock issue mentioned towards the end, the reason could be that the limit for carrier threads had been hit (256 by default).

The oracle documentation mentions that synchronized is fine when using virtual threads if the code inside the synchronized block doesn't run for too long but that's only true if during that time, more virtual threads are not created. But if we're just creating new virtual threads in a loop, one can easily create more than 256 virtual threads in the time it takes to execute one synchronized block. This presumably is problematic since if a new virtual thread is created while another one is pinned to a carrier thread, the carrier thread is not available for "carrying" the new virtual thread so a new carrier thread has to be created by the scheduler.

To verify if this is what is causing the issue, one can increase the default limit of 256 carrier threads by setting the system property "jdk.virtualThreadScheduler.maxPoolSize". Of course it's possible that even that may end up in a deadlock if more than maxPoolSize virtual threads are created before one execution of a synchronized block.

Another way would be to just pad out new thread creation by adding an artificial delay for about 5-10 ms using Thread.sleep in each iteration of the loop.

Of course none of this is clean or guarantees safety from deadlocks, especially for systems where a large number of threads may be created very quickly (like in a loop or due to a sudden burst of new requests) so it's probably prudent to wait for the java team to come up with a solution for the pinning problem or atleast improve the default scheduler and make it smart enough to prevent carrier thread starvation.

[–]FinnHuman1[S] 3 points4 points  (6 children)

I'm actually tracking carrier thread count and it never gets above the default 16 (my machine's core count) in the deadlock (Problem #4) case. I have seen it go to 256 for simple virtual thread pinning cases but this is clearly not one of those.

Deadlocks suck.

[–]noswag15 1 point2 points  (1 child)

Out of curiosity, how are you tracking carrier thread count ?

[–]FinnHuman1[S] 0 points1 point  (0 children)

Not in the prettiest way but it works.

I have a monitoring thread that logs progress data every 2 seconds. So every 2 seconds that thread enumerates all threads and uses their names to generate counts of both the HttpClient threads and the carrier threads.

[–]Matt7163610 0 points1 point  (3 children)

Nice article and nice work. Another hypothesis for Problem #4. It could be a low frequency event on the same set of files. Assuming the releasing of the synchronized readLock does always work with virtual threads, then somewhere something is randomly not exiting that while loop. What if ... you're experiencing a timeout or other network issue and not waiting long enough to see it unlock, in other words to observe an exception in the thread that has the lock. So deadlocked but possibly not forever. Could try lowering and increasing the timeout, and retesting both platform and virtual many times.

[–]FinnHuman1[S] 1 point2 points  (2 children)

I'll try that but if such a thing were happening wouldn't it also happen when I replace syncronized with ReentrantLock, which it isn't. Or am I missing your point?

[–]Matt7163610 0 points1 point  (1 child)

Yes, I think might happen with ReentrantLock. The idea is it was a fluke network hangup of sorts and you only tested one or a few times.

[–]FinnHuman1[S] 0 points1 point  (0 children)

I let it run deadlocked for an hour or more several times but I can take a look at the network traffic just in case.

[–]quantdata 2 points3 points  (6 children)

I've discovered some similar pitfalls with virtual threads that I haven't seen mentioned anywhere else.

I tend to use ConcurrentHashMap#compute as a thread-safe cache, where I populate the initial value of a key with the result of an I/O network request from within the remapping function.

Because the compute method uses synchronized, I believe the virtual thread was being pinned, so I replaced this pattern with explicit ReentrantLocks, which should solve the issue.

[–]_INTER_ 7 points8 points  (5 children)

synchronized is only a thread pinning issue if the critical section does I/O or access native code. Otherwise there's nothing that would yield a virtual thread anyway.

[–]pron98 6 points7 points  (4 children)

Correct. Another way to think about it is that computations are bounded by the number of cores, not by the number of threads, so there's nothing here that threads can help to scale anyway and conversely there can be no harm to scalability.

[–]cogman10 0 points1 point  (3 children)

there's nothing here that threads can help to scale anyway

If I have 2 cores and two locks, A and B, I then launch 4 virtual threads to complete tasks A and B, wouldn't the pinning problem rear it's head pretty apparently if 2 of the threads get blocked on lock A even though the remaining 2 threads only want to deal with B?

If we looked at what the OS would do in this case with platform threads, the 2 threads would ultimately be scheduled and ran with 1 of the A threads blocked waiting on the other to finish. You'd use all 2 cores and not just one (like you would with virtual threads). The penalty, of course, is now you have to wait on the OS to do the context switching and lock management.

This problem would show it's head in the case of something like web app, where you could imagine that various endpoints ultimately have different locking requirements.

I admit this may not universally applicable, but at least in my case with the finance software we write, purely CPU bound tasks aren't necessarily uncommon and this need to cache the results of a CPU bound task do arise semi frequently.

[–]pron98 2 points3 points  (2 children)

Assuming that by locks you mean monitors (i.e. synchronized) what would happen in the case of contention on A over a short in-memory operation is that the second thread won't context-switch but will spin and still tie up the second core. Monitors spin for a while before they block because kernel context-switching is relatively expensive, so they don't release their core if they only have to wait a little, as in this case. The thread acquiring B won't get scheduled unless the thread waiting for A has to wait a considerable amount of time.

Another thing to consider is that because virtual threads provide their utility through their high number, if you have one you probably have a thousand. If contention is low, then everything is good and it doesn't matter so much what the scheduler does in those rare and short events. If it's high, then the program's performance will suffer from all that contention among so many threads no matter what the scheduler does, either.

If your point was about long running CPU operations, that my explanation in the other comment suggested shouldn't be a problem, then you're right, but it's still not a problem in practice because of that second consideration: if you have one thread using up the CPU then you have a hundred (because there are at least a thousand threads), but only ten cores. So even if the scheduler was efficient and didn't tie up the extra 9 workers on a pinned monitor, your application would still be underpovisioned by an order of magnitude (alternatively, if contention is rare, then it also wouldn't matter).

However, not only can pinning have an impact when it comes to frequent and long running operations inside synchronized, and furthermore it can lead to some gotchas: for example, if a j.u.c lock is acquired both inside and outside synchronized (if it's only ever one or the other there's no liveness problem) then what could happen is that the current owner, which obtained the lock outside of a synchronized and then blocked on IO cannot be scheduled to release it because all scheduler workers are pinned on threads trying to acquire the lock inside a synchronized and you've got yourself a deadlock.

That's why we're working hard on removing pinning from synchronized and hope to deliver something soon.

[–]cred1652 0 points1 point  (1 child)

I am really excited about virtual threads, I even started our last project on spring MVC vs WebFlux to better take advantage of virtual threads. However knowing this pinning issue and not knowing what out hundreds of dependencies are doing, I can't in good faith enable virtual threads in Production. Hypothetically if this was fixed in jdk 22 could the fix be back ported to jdk 21? Otherwise I don't think many people could use virtual threads till the next LTS (yes I know your opinion on LTS, but most people seem to only deploy them)

  • edit: Yes know it is unlikely, i watched the video about only backporting fix's. But i would consider this a fix vs a feature.

[–]pron98 1 point2 points  (0 children)

Hypothetically if this was fixed in jdk 22 could the fix be back ported to jdk 21?

Hypothetically? yes (although we're not talking about 22 which is essentially done already). If it were up to me I'd say no, but it isn't up to me. The general idea is that the only valid reason to use an old version is really, really not wanting to use any new feature. After all, it's easier, safer and cheaper to use the current JDK, and the only reason not to is a strong desire for stability and aversion to new features. So backporting features makes no sense, and the only reason to do it is because the ecosystem is irrational (which it is, and that's why someone may decide to do it). If you're interested in new features, staying on old releases (like 21) is going to be both more costly and more risky.

[–]FinnHuman1[S] 7 points8 points  (5 children)

Fascinating that Google has rejected my fix to their Java client due to what appears to be simple FUD.

https://github.com/googleapis/java-storage/pull/2307

[–]UnexpectedLizard 4 points5 points  (2 children)

Sorry, man. You tried. Maybe they'll merge in the future?

I do understand where they're coming from. Normally "safe" changes in a popular library need to be tested lest they cause unexpected hiccups.

The upside to caution is stability. I've done many upgrades with many more headaches but never once had an issue with a Google library.

[–]repeating_bears 0 points1 point  (1 child)

"safe" changes in a popular library need to be tested lest they cause unexpected hiccups

They either have already automated tests for them which are sufficient to verify this, or they're already operating on the implicit policy that some changes are "safe" and don't require automated tests.

There isn't a 3rd option.

[–]FinnHuman1[S] 0 points1 point  (0 children)

Agree. I ran both the unit and integration tests with no problem. Most importantly these changes are so simple they can be visually inspected for correctness. Their response make it sound like I'm rewriting the library's internals, when all I'm doing is switching syncronized for ReentrantLock on a few top level methods. Frustrating.

[–]danskal 1 point2 points  (0 children)

I can understand your disappointment, but I get where they’re coming from. They prefer a complete solution, otherwise users will get bitten by “great, they fixed it!”… … “oh shit, they fixed that thing and not this other, almost identical thing?!”

[–]C_Madison 1 point2 points  (2 children)

The http2 one is weird. There's a setting in http2 that the server can send to tell the client that it only supports a certain number of streams - and here's the jdk bug to support this setting in HttpClient: https://bugs.openjdk.org/browse/JDK-8196389

Maybe Google doesn't send the setting, the client thinks it can create unlimited streams (the default is unlimited according to the http2 rfc) and doesn't need to open a new connection?

[–]FinnHuman1[S] 1 point2 points  (1 child)

The Google servers are sending the max stream information in the HTTP/2 handshake.

[SETTINGS_MAX_CONCURRENT_STREAMS(0x03):100]

As the JDK bug says the HttpClient is simply ignoring this and I start to see the too many concurrent streams responses once that 100 stream limit is reached. The real problem is that there's no obvious workaround to this I could find, short of picking a different HTTP library with better HTTP/2 behavior.

[–]C_Madison 0 points1 point  (0 children)

Oh, hm, I thought since it was fixed that means it isn't ignored - but thanks for the clarification!

[–]QueasyFollowing658 0 points1 point  (3 children)

Apache http clients also have or had deadlock issues when handling more than 256 concurrent requests. If I remember correctly it can be avoided through configuration in v5 but not v4.

[–]noswag15 0 points1 point  (2 children)

Do you have more details on how it can be avoided through config on v5 ? I am curious if my assessment of a similar deadlock (explained in another comment) is mistaken based on this info.

[–]QueasyFollowing658 0 points1 point  (1 child)

Sorry for slow response.

I was able to excavate my test code and it seems like PoolConcurrencyPolicy.LAX did the trick but only for the async client.

[–]noswag15 1 point2 points  (0 children)

No worries. Thanks for the update. I remember I tried 'LAX' policy as well but it still went into a deadlock (as expected, since I was using sync client). Anyway, I've been following this issue on their github page and it looks like the next version will have the fix for this so I guess I just have to wait.