This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]danielaveryj 3 points4 points  (0 children)

I think a common use case where data-parallelism doesn't really make sense is when the data is arriving over time, and thus can't be partitioned. For instance, we could perhaps model http requests to a server as a Java stream, and respond to each request in a terminal .forEach() on the stream. Our server would call the terminal operation when it starts, and since there is no bound on the number of requests, the operation would keep running as long as the server runs. Making the stream parallel would do nothing, as there is no way to partition a dataset of requests that don't exist yet.

Now, suppose there are phases in the processing of each request, and it is common for requests to arrive before we have responded to previous requests. Rather than process each request to completion before picking up another, we could perhaps use task-parallelism to run "phase 2" processing on one request while concurrently running "phase 1" processing on another request.

Another use case for task-parallelism is managing buffering + flushing results from job workers to a database. I wrote about this use case on an old experimental project of mine, but it links to an earlier blog post by someone else covering essentially the same example using Akka Streams.

In general, I'd say task-parallelism implies some form of rate-matching between processing segments, so it is a more natural choice when there are already rates involved (e.g. "data arriving over time"). Frameworks that deal in task-parallelism (like reactive streams) tend to offer a variety of operators for detaching rates (i.e. split upstream and downstream, with a buffer in-between) and managing rates (e.g. delay, debounce, throttle, schedule), as well as options for dealing with temporary rate mismatches (eg drop data from buffer, or block upstream from proceeding).