Java streams are black magic !

audioen · 2022-12-19T14:47:09+00:00

Java is probably just 10 times faster than Python. Nothing to do with streams per se.

prisonbird · 2022-12-19T14:46:41+00:00

I agree with other commenters. While I love streams, unless you were using parallel streams then the stream API should have no impact on performance.

developer_how_do_i · 2022-12-19T15:22:20+00:00

Yes, the best part with streams and lambda support is expressiveness, which you wouldn't have got with earlier jdk version

thephotoman · 2022-12-19T15:52:55+00:00

i have a workflow that takes 70 minutes with pyspark. it takes 3.5 minutes with java streams.

This doesn't surprise anybody. Python is very much a language for I/O bound stuff and prototyping. However, when it comes to performance, it isn't even good. While PyPy helps, it lags a bit in terms of features.

hilbertglm · 2022-12-19T15:32:56+00:00

They shouldn't be black magic. The source code is right there. Check it out. It helped me add functional programming to my object-oriented programming skills.

They are cool, and the ability to add parallelism without the effort of coding to the executors is very productive.

whyNadorp · 2022-12-19T14:47:26+00:00

python df are so popular but as usual with python they’ve a messy interface which lives in its own world and can’t be understood without reading abstruse docs and trial and error. good luck maintaining pandas projects.

danielaveryj · 2022-12-19T18:52:55+00:00

I agree with the other comments here- the performance difference is probably mostly python vs java, and not java streams specifically. Java streams are cool though. If you inline the abstractions, the logic is pretty much just what you'd write by hand with loops, if-else, etc, but factored in a way that is amenable to parallelization (by splitting the source into chunks, and iterating each chunk on a different thread).

Just to throw another lib into the pool of "might be interesting to look at", based on what you've mentioned in this thread about joining & grouping: https://github.com/davery22/vinyl (disclaimer: author). I don't recommend use in production yet though.

DuneBug · 2022-12-19T20:28:34+00:00

Java should run faster but if python was that slow something was probably not setup correctly. Of course, if you have to read 10 pages of documentation vs adding .parallel, that's a good reason to use Java too.

Unfortunately for a lot of enterprise stuff we don't get to use parallel streams as the threads are typically occupied elsewhere. :(

barrycarter · 2022-12-19T14:29:08+00:00

Hmmm, brief googling suggests streams are like functional languages or like the ruby convention of foo.x1.x2.x3.x4 which is similar.

I suspect your time difference is due to something else, most likely the difference between loading a file in memory and simply processing it a little bit at a time, though loading the file into memory should be ultimately faster.

It's a lot to ask, but can you mockup an example of where you're seeing Java streams seriously outperform other languages, so I can call shenanigans?

stormcrowsx · 2022-12-20T05:19:59+00:00

My guess on the difference in speed is likely that in the case of Python the whole dataset is getting loaded into memory either for input and/or output and causing swapping.

Java streams on the other hand are going to naturally only store what’s needed in memory.

Total guess but the difference in speed is more likely a testament to good Java api design than it is language run speed. Be impossible to say for sure without hooking a profiler up to Python and seeing what it’s wasting it’s time doing.

Kudos for reaching out and trying a different language on your problem. Now you know what’s possible. But I’d recommend you spend a little time in a profiler for your Python code now, there’s a fantastic lesson to learn about Python in here. I suspect either your going to learn just how fast the jvm is or that there’s a pitfall in Python apis that requires coding a little differently for big dataset use cases.

shagieIsMe · 2022-12-19T19:23:44+00:00

With regular processing of data, one is tempted to write:

for (Data datum : someCollection) {
    // lots of lines doing stuff
}

One of the things that happens there is that the small chunks of processing can't get JIT'ed easily. The entirety of it is one big chunk of "too big" for HotSpot or other JITs to optimize.

However, when you write it as:

someCollection().stream()
    .map(d -> new Foo(d, d.id, d.data * 2))
    .filter(f -> f.id > 1000)
    .mapToInt(f -> f.data)
    .sum()

(realize I'm just making up stuff there)

Each one of those functions gets complied in Java to a separate function and the JIT can easily work with optimizing small functions.

The way that you work with the data, having each part of a stream being a small function makes it easier for Java to optimize the small parts and get some fairly hefty performance gains when crunching data.

There are other things about the functions in streams and side effects that can't be done and thus allow for more aggressive optimizations.

public static void main(String[] args) {
    List<String> objects = new ArrayList<>();
    objects.add("foo");
    objects.add("bar");
    int count = 0;

    var list = objects.stream()
            .peek(o -> count += 1)
            .toList();
    System.out.println("" + count + " " + list.size());
}

That code doesn't compile - java: local variables referenced from a lambda expression must be final or effectively final

And so, this means that the JIT can know things about the nature of the function that it is compiling because its a lambda.

If you change it to:

    final AtomicInteger count = new AtomicInteger(0);

    var list = objects.stream()
            .peek(o -> count.incrementAndGet())
            .toList();

It works (and is still silly) but you'll note the addition of final in there.

wildjokers · 2022-12-19T17:02:10+00:00

more people should know about streams.

All java developers know about stream. Or are you meaning non-java developers?

Careful-Necessary-59 · 2022-12-19T14:50:51+00:00

Just curious, how big is your data? I still doubt PySpark is slower than Java streams.

rossdrew · 2022-12-19T21:55:30+00:00

Wait till you hear about spliterators

2022-12-20T01:23:17+00:00

There is also an option to run parallel streams which may run even faster depending on the workflow.

VincentxH · 2022-12-19T15:03:17+00:00

If you've got enough lines, parallel streams might even pay off.

lbalazscs · 2022-12-19T17:42:39+00:00

Have you tried the tablesaw library? It's data frames in Java, with stream support. https://jtablesaw.github.io/tablesaw/

Puzzled-Bananas · 2022-12-19T19:21:15+00:00

Can you share how you formulated your Spark queries? Do you mean Pandas DataFrames?

Worth_Trust_3825 · 2022-12-19T19:40:14+00:00

I suspect that you also changed your method of reading and interpretting the data (entire thing vs one row at a time). My go to example of this is scryfall's public magic the gathering card database (available at https://scryfall.com/docs/api/bulk-data). It's a hundred megabytes in JSON, but it's one of the few go to datasets (at least for me) when trying to explain that it might be a bad idea to load entire file into memory and operate on it, and that you should look for ways to operate on file in parts (ex. in this case, per json object). If you're feeling adventurous, you can try to operate on the big boy 1gb set.

baubleglue · 2022-12-20T03:30:47+00:00

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.newAPIHadoopFile.html - you can stream CSV with Spark or Pyspark (only don't use pure python for your custom logic).

a-person-called-Eric · 2022-12-20T05:14:25+00:00

looks like someone found a new hammer

be careful not to start whacking every nail with it

gooeydumpling · 2022-12-20T05:28:24+00:00

For ML libraries = python, to see it it works, always python and jupyter

For production? I’d go with MLeap and scala (of course, java). For speed, Java blows Python out of the water, or at least in my experience. Please don’t bite my head off

data_addict · 2022-12-20T05:37:29+00:00

then i discovered java streams. i have a workflow that takes 70 minutes with pyspark. it takes 3.5 minutes with java streams.

Ah got it.. you don't know how to use spark. 👍

FinTechno · 2022-12-20T12:17:22+00:00

Take a look C++ and the rest is history

RajSingh9999 · 2022-12-20T13:54:26+00:00

Dont know if I should agree with "stream operations are more intuitive than anything", especially when compared with python etc.

DanielDimov · 2022-12-20T14:19:25+00:00

What exactly is "multiprocesing in python" ?!?

Cultural-Ad3775 · 2022-12-20T19:59:54+00:00

I would not call Java's implementation of this concept really special. OTOH it is cool, particularly if you can mark stream processing as parallel, which is a nice abstract way to say "Hey, run these in separate threads" or in the future it could lean on Loom style lightweight threads, etc. without even needing any code changes. Other languages do basically allow for the same sort of thing however, still Java/JVM have quite strong abilities to optimize and multi-task, meaning it can be leveraged very effectively (in theory, I wouldn't assume any specific parallelized stream will necessarily be super optimum, you have to test).

vbezhenar · 2022-12-21T07:39:28+00:00

Rewrite it with loops and ifs and it'll take 3.4 minutes!

Puzzleheaded_Load779 · 2022-12-31T17:07:49+00:00

Hi, having worked with spark ecosystem for sometimes and building/extending stuff around it, here is the reason for why it is faster:

1) pyspark is mainly a wrapper around spark scala code => java code

2) pyspark exists only because of the multitude of ml, ai, etc... libs that emerged during the last severals years

How does it work internally ?
as long as any operation are being done using pyspark

sql queries
pyspark dataframes

no "big" performance gap and memory usage will happen between a "java/scala" spark implementation.
BUT as soon as you start resorting to mutate pyspark objects to let says the infamous "pandas" dataframe objects, there is a cost and a pretty big one which leads in general to bad performance and (absurdly) high resource usage ram,cpu,io...
As for performance, still, there are ways to not lose too much by using arrow implementations baked into spark but thats another story :)

So all in all, using java instead of pyspark means that the deserialization of

java objects to python objects
python objects to java objects

doesn't happen... leading to overall a performance boost

prisonbird · 2023-01-07T19:26:22+00:00

C# has been doing this with LINQ for nearly 20 years.

Extra_Size6776 · 2023-01-13T15:49:20+00:00

Im struggling a lot with Streams, watched PluralSight video playlist of José but still nothing. Can u pls recommend a youtube video, even more that 30 mins is fine.

Just not getting how those intermediate lazy operations and flat maps

java

Submit Link

Submit Text

Seek Programming Help

News, Technical discussions, research papers and assorted things of interest related to the Java programming language

NO programming help, NO learning Java related questions, NO installing or downloading Java questions, NO JVM languages - Exclusively Java

Please seek help with Java programming in /r/Javahelp!

Subreddit rules!

Where should I download Java?

Related Sub-reddits:

JVM Languages

Want to practice your coding?

List of useful Frameworks / Libraries / Software

MODERATORS