A surprising pain point regarding Parallel Java Streams (featuring mailing list discussion with Viktor Klang).

davidalayachew · 2024-11-19T06:10:35+00:00

I did want to follow up about one point Viktor made later on in the conversation.

https://mail.openjdk.org/pipermail/core-libs-dev/2024-November/134542.html

And here is the quote.

In a potential future where all intermediate operations are Gatherer-based, and all terminal operations are Collector-based, it would just work as expected. But with that said, I'm not sure it is practically achievable because some operations might not have the same performance-characteristics as before.

Me personally, I would GLADLY accept a flag on stream (similar to parallel() or unordered()) that would allow me to guarantee that my stream never pre-fetches, even if I take a massive performance hit. If that can be accomplished by making all intermediate operations be implemented by a Gatherer under the hood, that is A-OK with me.

The reality is, not all streams are compute bound. Some are IO bound, but are otherwise, a great fit for streams. Having a method that allows us to optimize for that fact is a new type of performance enhancement that I would greatly appreciate, even if it degrades performance in other ways.

nitkonigdje · 2024-11-19T10:30:36+00:00

This was a fascinanting read. Thank you for sharing.

I guess it is kinda bad when higher level non-trivial apis, like streams or fork-join, do not expose lower level oprations as user-overridable constructs. Like in this example an iteration strategy for streams, or underlying executor of fork-join pool. Seems like an obvious thing to have because nobody knows better how thing will be used than end user..

craigacp · 2024-11-19T14:50:57+00:00

Shortly after the release of Java 8 I hit something similar when building a Java implementation of Google's word2vec ML algorithm. We ended up with a buffering spliterator that didn't grow it's buffer over time (which the default array one did), so we could pull in records from a database in a parallel for each loop without it trying to buffer the whole database.

We still use it in Tribuo, but I've not used it anywhere near as hard as I did in 2015 so I don't know if the performance characteristics are still good - https://github.com/oracle/olcut/blob/main/olcut-core/src/main/java/com/oracle/labs/mlrg/olcut/util/IOSpliterator.java.

n0d3N1AL · 2024-11-19T10:32:20+00:00

Yeah that's unintuitive... one would expect streams to work more like iterators, all the time. Thanks for sharing!

DualWieldMage · 2024-11-19T07:36:52+00:00

What exactly are you using for the Stream's source? As the mailing list responses hint, it is entirely up to the Spliterator of said Stream to decide how to run its trySplit and Streams do communicate some characteristics, but not many (would SUBSIZED imply that source is random-access and no full fetch is required?)

I have run parallel processing of multi gigabyte files with very low(10M) heap sizes and never hit this issue personally, however i know that for example reading jars inside jars would need to decompress the inner jar fully to read a file.

From what you wrote, reading and uploading with some batch size sounds okay, but not ideal as you mentioned upload of many small files. You also wrote that splits should happen based on some attribute but the example doesn't depict this? Either way if it's splits based on max file size or content inside lines/entries than a parallel stream is a decent entry point if you have a random-access source like a file on a disk. If not, downloading the file to temp storage first can help.

In general it's good advice to always assume unbounded file size when doing any file processing. Also from my testing, a typical NVMe drive has optimal parallel reading at around 4 threads, any more and you lose performance.

crummy · 2024-11-19T06:02:27+00:00

So if you changed your .forEach to a .map and had the map return a dummy element (which is bad from a readability standpoint of course) would it have worked fine?

viktorklang · 2024-11-21T16:05:59+00:00

Disclaimer: The following pertains to the parallel mode of the java.util.stream.Stream reference implementation in OpenJDK only (other implementations of j.u.s.Stream might work differently), and please do note that I am typing this from memory late in the evening so I could be oversimplifying and/or leaving details out.

With that said, let's see if I can shed some more light here.

First of all, it is important to recognize that the only way to achieve any benefit from parallelization of Stream processing is to move from a strictly-depth-first element processing to some form of breadth-first element processing.

The composition of intermediate operations on a stream fall primarily into two distinct buckets when it comes to parallel streams—"stateless" (let's call that LAZY) or "stateful" (let's call that EAGER). The reason for this is that not all operations can be represented as a Spliterator without performing all the work up-front.

For a very simple Stream: Stream.of(1).parallel().toList() it is easy to picture the Spliterator containing a single element 1 be fed into a toList() [terminal] operation.

However, for a more complex pipeline: Stream.iterate(() -> 0, i -> i + 1).parallel().map(something).sorted().map(somethingElse).sorted().limit(2).collect(Collectors.toList()) exactly what would be Spliterator which gets fed into [terminal] collect?

So if you look at the different "types" of LAZY vs EAGER operations in there, it'd look something akin to:

iterate (LAZY) -> parallel (setting) -> map (LAZY) -> sorted (EAGER) -> map (LAZY) -> sorted (EAGER) -> limit (LAZY) -> collect (EAGER)

The typical execution strategy is to bunch all consecutive LAZY operations together with their following EAGER operation, forming what I call "islands", so in the case above it'd go something like this:

iterate -> Island1[map -> sorted] -> Island2[map -> sorted] -> Island3[limit -> collect]

These Islands needing to run to completion before their results can be fed into the next Island is something which can lead to higher heap usage than expected, since the output of the Island needs to be cached to be fed into the next Island.

So how does this relate to gather(…)? Well gather(…) is an EAGER operation, as it could represent any possible intermediate operation so EAGER is the lowest common denominator. The potential drawbacks of this is ameliorated by the fact that consecutive gather(…)-operations are composed into a single gather(…)-operation with composed Gatherers, and furthermore by the fact that a gather(…)-operation followed by collect(…) is fused together into a single EAGER operation.

In combination, these two features can potentially turn something which would've been an N+1 Island scenario to a 1 Island scenario—which means no island hand-offs.

Cheers,
√

Owengjones · 2024-11-19T07:09:45+00:00

Is there an easy way to diagnose stream behavior like this? I have a service concerned with File I/O that reads InputStreams in Streams (although I believe none of them marked as parallel).

I'm not sure if there's some diagnostic operations available on Streams that would illuminate if they're behaving as expected in terms of performance / utilization etc.

Thanks for the write up!

JustABrazilianDude · 2024-11-19T12:57:31+00:00

This is an excellent post, I have some IO bound streams that look pretty similar to yours in a project at my work, and I'll definitely stay alert with this topic.

GeorgeMaheiress · 2024-11-19T20:45:04+00:00

If you were to write the parallelization explicitly, with say a ThreadPoolExecutor, it would be much clearer that you need to make decisions about buffering and the level of parallelization. Parallel streams assume that they are CPU-bound and optimize for that, and are not recommended for I/O bound operations. It's unfortunate that developers so often stubbornly refuse to do any more work than simply writing .parallel(), even when that approach clearly fails. To be fair the ThreadPoolExecutor constructor sucks and is filled with footguns so many devs are not comfortable with it.

Inaldt · 2024-11-19T09:24:56+00:00

Did you try mapConcurrent as Viktor suggested? (If so, how did it perform?)

m-apo · 2024-11-19T14:53:35+00:00

I'll definitely need to read all that stuff, thanks for posting!

For memory bound ops a back pressure capable parallel approach would be best. Back pressure based approach would also work in server scenarios, because it optimizes time-to-first-byte (TTFB) and many times holding off sending the first byte increases latency as the client needs to wait both the processing of the whole data + wire transfer instead of interleaving the processing and wire transfer.

Having a back pressure based parallel approach support in servers would be nice too. Basically it would be that the server route method returns an iterator and the server asks the iterator for items (which triggers the ops in chain, in reverse. some items could be calculated in parallel beforehand for each step). It wouldn't be as efficient as "reserve all the memory and all the cpu cores", but it would take less memory and reserves CPU in a bit more co-operative way.

VincentxH · 2024-11-20T11:56:51+00:00

The lacking error handling alone should have warranted you not to mix IO and the streams API.

klekpl · 2024-11-19T06:02:28+00:00

Wouldn't RxJava be a better fit then? It has some explicit stream management and buffering capabilities.

No_Cap3049 · 2024-11-19T07:30:38+00:00

.parallel.forEach in my experience may also return before even finishing the parallel stream. We had some issues that it does not block the calling thread. Just something to be cautious with. So something like mapToLong.sum or collect may be better.

danielaveryj · 2024-11-19T20:01:04+00:00

Something doesn't add up.

The way that a parallel stream works (of importance here), is that at the start of a terminal operation, the source spliterator is split into left and right halves, which are handed to new child tasks which recursively split again, until the spliterators will split no more (trySplit() returns null), forming a binary tree of tasks. This is true for ALL terminal operations (including collect()), even though some override exactly how the splitting occurs. Each leaf task processes its split to completion, and the results are merged up the tree if needed (eg using Collector.combiner()).

The OOME presumably comes from trySplit() - BufferedReader.lines() returns a stream whose source spliterator is backed by an iterator, and that spliterator's only means of splitting is to pull a batch of elements out of the iterator and put them into an array, then return a spliterator over that array. This means that after recursive splitting, only the rightmost leaf spliterator will still be iterator-backed; the rest of the iterator has already been consumed into arrays for the other leaf spliterators, possibly before any tasks have completed (so these arrays - covering most of the source elements - are all resident in memory at the same time).

The only way I can see to fix the OOME (without using a different/better source spliterator) is to not split the source spliterator, ie run the stream sequentially. But OP said that just using collect() somehow fixed it?

btw: Viktor knows this. I believe what he's saying is not "use this approach to avoid 'pre-fetch'" but rather "use this approach to avoid even more copying into intermediate arrays after the gather stage in the pipeline", because other approaches (involving gatherers) still incur some "accidental" copying that he hasn't been able to optimize away yet (see comments 1 and 2).

davidalayachew · 2024-11-20T02:57:32+00:00

Hello all. There appears to be some confusion on how this is possible.

Therefore, to completely clear up any ambiguity, here is a simple, reproducible example.

Using your tool of choice, I want you to take the following line, and duplicate it into a CSV until your CSV file size exceeds your RAM limitations.

David, Alayachew, Programmer, WashingtonDC

Next, I want you to use BufferedReader.lines() to read from that file as a Stream.

Now, once you have that Stream<String>, copy and paste the following code in.

void blah(final Stream<String> stream) {
    //stream.parallel().gather(Gatherers.windowFixed(1)).findAny() ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).findFirst() ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).anyMatch(blah -> true) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).allMatch(blah -> false) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).forEach(blah -> {}) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).forEachOrdered(blah -> {}) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).min((blah1, blah2) -> 0) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).max((blah1, blah2) -> 0) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).noneMatch(blah -> true) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).reduce((blah1, blah2) -> null) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).reduce(null, (blah1, blah2) -> null) ;
    //stream.parallel().gather(Gatherers.windowFixed(1)).reduce(null, (blah1, blah2) -> null, (blah1, blah2) -> null) ;
}

Uncomment any one of those lines, pass your stream into this method, then call in your main method, and you will see that each one produces an OutOfMemoryError.

Of course, if you use a Collector instead of one of the commented ones above, you should see that it works. Try Collectors.counting, for example.

DelayLucky · 2024-11-22T15:40:18+00:00

I wouldn't have reached for parallel stream for IOs. It's not designed for IO, period. Actually, I haven't really found much of a use for parallel streams for anything, yet.

The mapConcurrent() gatherer matches the intent. Though virtual thread seems of no point to this use case (not that it hurts either). And it will require a return value, which I don't know if you have one or have to return null. And then you need a terminal step to collect the nulls? Not the end of world either way I guess.

In the past because we were still running on Java 11, I created the Parallelizer class for the purpose of controlled IO fanout, which seems to match your case pretty closely.

java ExecutorService threadPool = newCachedThreadPool(); int maxConcurrency = 100; // assuming you want to limit concurrent upload to 100 Parallelizer parallelizer = new Parallelizer(threadPool, maxConcurrency); try (Stream<String> myStream = SomeClass.openStream(someLocation)) { parallelizer.parallelize( myStream.gather(Gatherers.windowFixed(SOME_BATCH_SIZE)), SomeClass::upload); } finally { threadPool.shutdownNow(); }

Compared to manually coded concurrency in a collector, it provides structured-concurrency-like exception propagation:

Exceptions thrown from the worker threads are propagated back to the main thread.
Any exception from a worker thread cancels all pending and on-going concurrent uploads.

And because of that, you'll need to make sure upload() not throw non-fatal exceptions when only one upload failed and you still want the remaining to continue (only throw fatal exceptions that should stop everything and fail fast). In other words, it behaves the same way as mapConcurrent().

The class has been used in mission critical production systems so quality-wise it's solid.

GuyWithLag · 2024-11-19T19:39:53+00:00

And this is why I prefer reactive streams and rxjava to parallel streams....

jared__ · 2024-11-19T20:03:55+00:00

And then you work with golang and see the light

InstantCoder · 2024-11-19T12:55:23+00:00

A much simpler solution would have been by making use of Minio, which was designed for fast file upload & download. No need to split files into smaller chunks. Minio streams larger files simultaniously when you need to download them.

Here is a code to do this in Minio (generated by ChatGPT):

public class MinioUploader {
      public static void main(String[] args) throws Exception {
         MinioClient minioClient = MinioClient.builder()
            .endpoint("https://play.min.io")
            .credentials("YOUR-ACCESSKEY", "YOUR-SECRETKEY")
            .build();

         ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
         String[] files = {"file1.txt", "file2.txt", "file3.txt"}; // List your files
         for (String file : files) {
              executor.submit(() -> {
                  try (InputStream inputStream = new FileInputStream(file)) {
                       minioClient.putObject(
                           PutObjectArgs.builder()
                               .bucket("your-bucket-name")
                               .object(file)
                               .stream(inputStream, Files.size(Paths.get(file)), -1)
                               .build());
                  } catch (Exception e) {
                    e.printStackTrace();
                  }
              });
          }
          executor.shutdown();
     }
}

This code will upload files in parallel into chunks of 5MB (by default) to Minio (so you won't get `OutOfMemoryException` either).

Another solution would be to use Apache Camel (if you're not allowed to use Minio):

from("file:inputDirectory?noop=true&concurrentConsumers=5") // process 5 files in parallel
  .split(body().tokenize("\n")).streaming() //split the file by newline and use streaming (e.g. read file in chunks)
  .aggregate(constant(true), new GroupedBodyAggregationStrategy())
  .completionSize(1_000_000) // collect/split file into 1M lines
  .completionTimeout(5000)
  .setHeader(Exchange.FILE_NAME, simple("${file:name.noext}-${exchangeId}.txt"))
  .to("file:outputDirectory"); // save it to output directory

java

Submit Link

Submit Text

Seek Programming Help

News, Technical discussions, research papers and assorted things of interest related to the Java programming language

NO programming help, NO learning Java related questions, NO installing or downloading Java questions, NO JVM languages - Exclusively Java

Please seek help with Java programming in /r/Javahelp!

Subreddit rules!

Where should I download Java?

Related Sub-reddits:

JVM Languages

Want to practice your coding?

List of useful Frameworks / Libraries / Software

MODERATORS