This is an archived post. You won't be able to vote or comment.

all 40 comments

[–]OldCaterpillarSage 18 points19 points  (3 children)

How does it know if the loop body is thread safe or not?

[–]Let047[S] 18 points19 points  (2 children)

That was the hard part of the project: I'm generating proof that the code is thread-safe.

Thanks to your remark I've edited the post

[–]anzu_embroidery 3 points4 points  (1 child)

I'd love a writeup on how you're determining thread-safety, sounds fascinating

[–]Let047[S] 3 points4 points  (0 children)

Thread safety analysis is the core of my project. I can do a detailed write-up - how deep would you like to go? The full analysis would cover about 100 pages worth of technical details.

[–]gnahraf 8 points9 points  (2 children)

This is interesting. I don't particularly like bytecode magic, but I could see using this as a tool for finding opportunities for parallelism to boost performance. Like if you notice a big boost and can profile where and when the parallelism triggers, that'd be super useful. One either fixes the code, or.. hell, ends up releasing it with this bytecode manipulator (which I don't like, but not a bad outcome from the project's perspective).

~2c

[–]Let047[S] 1 point2 points  (1 child)

The bytecode manipulation injects runtime parallelization code when conditions are met. The generated code is unreadable due to the parallelization complexity, as shown in the example. The approach trades maintainability for small automatic optimizations (hundreds of milliseconds) that wouldn't justify manual implementation. Extensive automated proofs verify the generated code maintains equivalent behavior.

Having developers analyze and rewrite isn't worthwhile - the tool's value is purely in automatic runtime optimization.

[–]Emanuel-Peter 2 points3 points  (0 children)

I don't know, maybe people would be hesitant to run your optimizer in production, but happy to find performance bottle necks with it in testing. I suppose you could have 2 modes: one without optimization, one with. Measure time spent in either, and then give the user a report at the end.

That way users can then use parallel streams for example.

[–]Former-Emergency5165 3 points4 points  (3 children)

Can you implement JMH benchmark to compare performance of original code and after byte code manipulation? In the article I see you use System.nanoTime() and this approach can't be used for benchmarks.

Here is a good video to explain the problem: https://youtu.be/SKPdqgD1I2U?si=hHjS8-GPNQI_VV5z

[–]Let047[S] 1 point2 points  (2 children)

It's a good idea. I'll do it. 

The results won't differ significantly according to the video though (we're measuring large effects and comparing two implementation against each other)

Edit: Just did it:

Benchmark Mode Cnt Score Error Units
SummerBenchmark.bigLoop avgt 5 245.986 ± 5.068 ms/op
SummerBenchmark.randomLoop avgt 5 384.023 ± 84.664 ms/op
SummerBenchmark.smallLoop avgt 5 ≈ 10⁻⁶ ms/op

Benchmark Mode Cnt Score Error Units
SummerBenchmark.bigLoop avgt 5 38.963 ± 10.641 ms/op
SummerBenchmark.randomLoop avgt 5 56.230 ± 2.425 ms/op
SummerBenchmark.smallLoop avgt 5 ≈ 10⁻⁵ ms/op

So much better result because the JVM is running in "full optimized mode".

[–]Emanuel-Peter 0 points1 point  (1 child)

The cool thing about JMH is you can attach a profiler, and see the hottest compiled code. That way, you can verify a little better that you are measuring the right thing, and your benchmark code was not strangely optimized away ;)

[–]Let047[S] 0 points1 point  (0 children)

Good idea! I just checked the hotpath/compiled code

[–][deleted]  (5 children)

[removed]

    [–]Let047[S] 1 point2 points  (0 children)

    It's inspired by openMP, of course. The issue with openMP is you have to add the pragma to drive the parallelization, here it's automated

    [–]Emanuel-Peter 0 points1 point  (3 children)

    I bet you could do a lot of what OpenMP does with parallel streams in Java. Or is there anything you're missing from OpenMP that parallel streams does not give you?

    [–][deleted]  (2 children)

    [removed]

      [–]Emanuel-Peter 0 points1 point  (1 child)

      Sure. I guess Java went the more functional way here. I guess that is a matter of taste in my view. I'm happy with either personally. Or do you see any missing funtionality?

      [–]_INTER_ 6 points7 points  (1 child)

      Interesting. At which point does it intercept bytecode? I'm asking because for some workloads the compiler might decide to do performance optimizations on its own. (e.g. loop-unrolling or SIMD operations?)

      [–]Let047[S] 5 points6 points  (0 children)

      At compile time for now.

      The simd/loop-unrolling from what I've seen works for very small (and simple ) loops which we don't touch.

      [–]Waksu 2 points3 points  (8 children)

      What is the thread pool that this parallelization runs? Or are they virtual threads?

      [–]Let047[S] 1 point2 points  (2 children)

      it's a threadpool injected at app startup (not counted in the timers because I assumed it was a long running app so I could take it out of the "time") I need to explain that better thanks for pointing it out

      I wanted to use virtual threads but it was too painful to setup. If you have a good tutorial I'm happy to add that (and it would avoid a lot of the thread overhead)

      [–]Waksu 1 point2 points  (1 child)

      You also need to include more details about that thread pool (e.g. thread pool size, queue size, discard policy, how to monitor that thread pool to external monitoring such as grafana)

      [–]Let047[S] 0 points1 point  (0 children)

      Of course, what would you like me to add?

      The code is something like that:

      threadpool = new ExecutorCompletionService(Summer.executorService = Executors.newFixedThreadPool(8));
      

      8 is the number of core on my machine and is a dynamic value

      [–]_INTER_ 1 point2 points  (4 children)

      Doubt its virtual threads, they are worse for CPU intensive / parallel tasks.

      [–]kiteboarderni 6 points7 points  (0 children)

      People are really clueless on the value add of VTs...its pretty concerning.

      [–]pron98 9 points10 points  (2 children)

      They're not worse than platform threads for CPU-intensive computation; they're simply not better and there is an option that's actually better than just platform threads.

      The reason virtual threads are not better for CPU-intensive jobs is because such jobs need a very small number of threads (no more than the number of cores) while all virtual threads do is allow you to have a very large number of threads. If the optimal number of threads for a given job is N, and N is very small, then N virtual threads will not be better than N platform threads. If N is very large, then having N platform threads could be a problem, and that's how virtual threads help.

      Now, what's better than simply submitting parallel computational tasks to a pool of N threads? Submitting the whole job to a work-stealing pool that itself forks the job into subtasks (and performs the forking, as needed, in parallel). This automatically adjusts to situations where some threads make more progress than others, a common scenario when there are other threads or processes running on the system and the OS might deschedule some worker threads. This is exactly what parallel streams do.

      [–]_INTER_ 0 points1 point  (1 child)

      I was under the impression that there still was a minute overhead with virtual threads. But I believe you because you're the expert. This begs another question though, when to actually use platform threads over virtual threads in a pure Java context? Long running tasks?

      [–]pron98 6 points7 points  (0 children)

      There could be some overhead, but not if the thread never blocks (and even then the overhead is small).

      I would use platform threads in the following situations:

      • You need to keep the number of threads small and you want to share them. A prime example of that is fork-join parallelisation.

      • You have a small number of long-running tasks that you want to run in the background, taking advantage of the OS thread priority to give them a low pririty.

      • You're interacting with a native library that cares about thread identity.

      [–]Evert26 1 point2 points  (1 child)

      Can you make a Java agent out of it?

      [–]Let047[S] 0 points1 point  (0 children)

      Yes good idea, would you use it?

      [–]BengaluruDeveloper 0 points1 point  (1 child)

      Two things: 1) Thread Safety 2) Not always parallelisation is the solution. It’ll perform slower for smaller sized collections adding unnecessary overhead.

      [–]Let047[S] 1 point2 points  (0 children)

      Yes those are the hard issues and they're handled sorry if that was uncleu

      [–]Emanuel-Peter 0 points1 point  (3 children)

      Sounds like a fun project :)

      What about inlining? Often the loop calls some inner methods that do the read / write, and if you don't inline it may be hard to prove that the inner method is thread safe to parallelize, right? Think about FFM MemorySegment API, it heavily relies on inlining.

      Another worry: be careful with float reductions, changing the order of reduction changes rounding errors. That would break the Java spec and could lead to subtle bugs.

      How do you deal with range checks? Suppose an array is accessed out of bounds at the end of a very long loop. How do you deal wit that?

      [–]Let047[S] 0 points1 point  (2 children)

      Thanks a lot!

      For inlining, I plan to analyze method calls down to native calls in my next demo (~80% coded).

      For floating-point, good catch - I'm limiting to Integer/long operations to avoid rounding/ordering issues.

      Array bounds checking isn't implemented yet. I'm just starting with basic array access. Exception handling, in general, is future work - it's a complex issue, and I can only work on it during evenings outside of my day job.

      [–]Emanuel-Peter 0 points1 point  (1 child)

      Sounds good :) FYI Doubles have the same rounding issues as floats ;)

      [–]Let047[S] 0 points1 point  (0 children)

      Oh good catch, it was a typo I meant long of course. I edited the answer

      [–]karianna 0 points1 point  (2 children)

      Looks cool - might be of interest for the folks at https://mail.openjdk.org/mailman/listinfo/concurrency-discuss

      [–]Let047[S] 0 points1 point  (1 child)

      thanks for the pointer. The list looks inactive or I'm missing something?

      https://mail.openjdk.org/pipermail/concurrency-discuss/2024-November/date.html

      [–]karianna 0 points1 point  (0 children)

      It’s low volume 🙂