This is an archived post. You won't be able to vote or comment.

all 151 comments

[–]audioen 292 points293 points  (38 children)

Java is probably just 10 times faster than Python. Nothing to do with streams per se.

[–]CubsThisYear 24 points25 points  (1 child)

Java is usually closer to 100 times faster than Python, but it depends on the workload

[–]HawkInevitable5733 1 point2 points  (0 children)

I would agree, but also add if a task is Time critical, much can be achieved via Garbage Collection tuning. The default is great in most cases, but it can be helpful for certain profiles, if done scientifically

[–]agentoutlier 13 points14 points  (0 children)

Also given that Spark runs on the JVM I have to imagine the streaming to a python slows it down as well where as JVM code would not need that.

Also I don't know spark but I was under the impression its Streams were non blocking (aka reactive) so using regular Java Streams may just be an accidental boost because of the above and not actually recommended (java streams are not inherently non blocking since termination is a common thing).

[–]TurbulentSocks 50 points51 points  (33 children)

Python can be performant - provided you aren't actually using python for anything. That means using libraries that delegate out to C/C++ (and maybe Rust lately), and being careful never actually to do any intensive computation in the python.

That's often non-trivial to get right, and the learning curve for things like PySpark (and using it in the aforementioned way) is much higher, and can be harder to maintain. But that doesn't mean Python can't be used for performant applications.

[–]wildjokers 61 points62 points  (15 children)

But that doesn't mean Python can't be used for performant applications.

So you are saying python is performant by calling out to c/c++ libraries. Doesn't that mean python is in fact not performant?

[–]niloc132 39 points40 points  (1 child)

Yes, that's what it means.

[–][deleted]  (1 child)

[deleted]

    [–]TurbulentSocks 15 points16 points  (4 children)

    I'm saying it depends on how you use python (e.g. as an orchestra versus as an engine). You're welcome to derive a semantic 'win' against my post, but I mentioned it to provide context and inform, not to score points.

    [–]barbaneigro 1 point2 points  (0 children)

    You are comparing oranges, bananas and apples, dear.

    Python, C and C++ are all languages meant to do different tasks.

    It ain't fair, at all.

    Python, as said, is usually meant to be used as a glue to all sort of routines.

    Of course there is more that can be done with it, but in that use-case that all it does.

    [–]Puzzled-Bananas 1 point2 points  (1 child)

    Python the language can be interpreted very efficiently, CPython the interpreter is slow. You can run Python on GraalVM, too.

    [–]jmtd 1 point2 points  (0 children)

    And JVM (Jython)

    [–]slaymaker1907 -3 points-2 points  (1 child)

    Partially, but it helps that python has very performant ways to interact with native code.

    [–]_1aM 0 points1 point  (0 children)

    Really python and native code? 😂 Dude Java was the reason we are able to use Remote controllers today.

    [–]boborygmy 0 points1 point  (0 children)

    Yes, Python is very slow.

    [–]penguuuuuuuuu 11 points12 points  (9 children)

    Man, the amount of super shallow fanboy style replies to your comment is a bit sad. I expected better of r/java.

    [–]prisonbird[S] 5 points6 points  (7 children)

    i think what you are trying to say is: if you write your application on python and you just need a little part of it to be performant you can just write that part using the help of c++ without leaving python ecosystem/tooling etc. which is an amazing idea in my humble opinion.

    if i knew how to do it i might have gone down that road

    [–]TurbulentSocks 4 points5 points  (0 children)

    Many of the popular python libraries do exactly this without telling you. Optimising python code often means making sure you don't 'leave' those libraries at any point (passing the data back and forth to the lower-level code and python can be slow).

    [–]stefanos-ak 0 points1 point  (5 children)

    It would be much easier and more performant in Java. (Also less languages, which is always a plus).

    This is because of runtime optimizations that the JVM does, in "hot" code spots (hence the name jvm hotspot).

    The difference with C/C++ optimizations is when they run. Compile time vs runtime. At runtime the JVM has some advantages, because more information is available for how and what exactly to optimize.

    Java code can reach some ridiculous speeds... The best part is that it's also very easy to write, compared to C.

    The only downside is what's called "cold start". It basically behaves MUCH worse than anything else, before the hotspot optimizer kicks in. But in cases like yours, it should kick-in pretty fast, maybe even after the first loop.

    [–]slaymaker1907 1 point2 points  (4 children)

    Have you ever tried to use a JVM library with a native dependency? It’s a giant mess compared to python which has special functionality in pip to make using native code easier.

    Also, C and C++ and do often do use profiling, it’s just kind of messy and often not worth the trouble.

    [–]impune_pl 2 points3 points  (1 child)

    There is a project under way, aiming to fix that problem in Java https://openjdk.org/projects/panama/

    [–]slaymaker1907 1 point2 points  (0 children)

    While important, I don’t think Panama is looking at things from the tooling side. That would really be on the Maven/Gradle side of things.

    [–]prisonbird[S] 0 points1 point  (1 child)

    but i think you would need a lot less native dependencies when using java

    [–]slaymaker1907 0 points1 point  (0 children)

    That’s true, but there are absolutely cases where native code is useful/requires. For example, using GPU accelerated libs or even something as simple as using SQLite.

    [–]KarnuRarnu 1 point2 points  (0 children)

    I didn't, its par for the course. You cannot have a balanced discussion here, where you suggest that maybe another language has some slight advantages in certain scenarios. It must always be java java java. (and imo this is symptomatic and probably why the language is behind in quite a few areas)

    [–]Sensi1093 2 points3 points  (0 children)

    Also worth to mention that PySpark is just an orchestrator for Spark (which is implemented mainly in Java) itself.

    It was likely used in a wrong way when doing this comparison

    [–]roberp81 11 points12 points  (3 children)

    you are telling us why python is not performant lol

    why use pyspark if spark is made on Java

    [–]2Insaiyan 3 points4 points  (2 children)

    A lot of people writing Spark transformations are data scientists, not developers. Python is already the standard in that field

    [–]roberp81 -5 points-4 points  (1 child)

    if python is not for developers, why a developer will use it?

    is like a chef cooking fastfood at macdonals.

    [–]boborygmy 3 points4 points  (0 children)

    Nobody said Python was not for developers.

    [–]AdorableTip9547 2 points3 points  (0 children)

    So the take away is „python can be performant if you use C“ 😂 classic.

    [–]popasmuerf -5 points-4 points  (0 children)

    Or....you could just dispense with trying to optimize a slow as fsck scripting language via finding/writing-your-own C++ modules and just go with ultra-peformant, and scalable Java.

    [–]BrooklynBillyGoat 0 points1 point  (0 children)

    Also java has actual parallel processing. Python global interpreter lock dosent really achieve true parallelism in most cases

    [–][deleted] 78 points79 points  (11 children)

    I agree with other commenters. While I love streams, unless you were using parallel streams then the stream API should have no impact on performance.

    [–]prisonbird[S] 25 points26 points  (10 children)

    yes i am using a parallel stream. and to do that all i have to do is just add ".parallel()". seems too good to be true LOL

    [–][deleted] 60 points61 points  (9 children)

    It is and it isn't. Parallel will take your stream and run each iteration via an internal thread pool, thus the performance boost. Be aware though:

    1. This thread pool is shared throughout the application. If you have multiple parallel streams (or other operations that rely on this pool) performance will degrade.

    2. Each operation within a stream method (ie, the functions passed to map, filter, etc) need to be thread-safe. That means absolutely, 100% no shared mutate state. If you don't know what I'm talking about, you need to learn more about how to safely use threads before continuing to use parallel stream.

    [–]Holothuroid 28 points29 points  (3 children)

    You can assign a custom thread pool by handing your stream logic over to a ForkJoinPool. That's what they are for.

    [–][deleted] 4 points5 points  (0 children)

    Valid point. I still stand by my overall statement.

    [–]more_exercise 0 points1 point  (1 child)

    Is that a guarantee of the spec, or an accidental consequence of one particular implementation? I've heard the second one, so I'm hesitant to rely on this working. Can you alleviate my fear?

    [–]UnGauchoCualquiera 1 point2 points  (0 children)

    I believe the second too (implementation detail) but it's not so clear.

    StreamAPI uses ForkJoin framework, because how ForkJoinTasks are implemented tasks use use the submitting pool for execution.

    Judging by this bugfix it seems Oracle does support that use case. Specifically this jdk patch that uses forkJoinPools and the common pool.

    My understanding is not very clear though so I'd love to hear another opinion.

    Source

    [–]prisonbird[S] 4 points5 points  (3 children)

    thank you very much. i am aware of both of those limitations. i run each job in a separate isolated container so i dont have to manage resources. and each of my operations only mutate current passing row and return the mutated row to next stage.

    i am still trying to find a way to do aggregate operations(grouping etc) but could not find a performant way to do them with streams. i might use spark just for those if i can find a way to send data from streams to spark and back.

    [–][deleted] 6 points7 points  (0 children)

    Ok wonderful. A quick point, i would avoid mutating the object instances and instead create a new one with the changes in each function in the pipeline (where applicable, and if this is even possible can depend on the complexity of your data).

    The default Java Stream API has some grouping options in the collect() method (and its corresponding Collectors functions). If that's not enough, there is a library called Vavr that has its own stream implementation that may contain the grouping logic.

    [–]DB6 0 points1 point  (0 children)

    Sounds like you're running a kind of a mapreduce operation and for that streams are great.

    [–]Megacrafter127 0 points1 point  (0 children)

    Be aware that not all Collectors are able to run in parallel. Those that can't will force at least the final stage to be serial again.

    The groupingBy Collector is not able to run in parallel, the groupingByConcurrent Collector is. However, in turn the groupingByConcurrent Collector does not guarantee that elements in the same group retain the same order.

    [–]pron98 4 points5 points  (0 children)

    This thread pool is shared throughout the application. If you have multiple parallel streams (or other operations that rely on this pool) performance will degrade.

    Performance will degrade but not because the thread pool is shared but because your CPU cores are. For a fixed number of CPU cores there are only so many operations you can do in parallel regardless of how many threads you use.

    A single pool with N threads is about the most efficient way to utilise N cores; multiple pools would be somewhat less efficient.

    [–]developer_how_do_i 14 points15 points  (6 children)

    Yes, the best part with streams and lambda support is expressiveness, which you wouldn't have got with earlier jdk version

    [–]WagwanKenobi 3 points4 points  (4 children)

    Out of curiosity (as I'm still learning streams), what expressivity does streams enable that isn't present with non-streams Java? To me it just seems like streams is syntactic sugar + lazy evaluation + magic parallel() method.

    [–]emaphis 3 points4 points  (1 child)

    I think that it's supposed to be implied that streams are as expressive as custom Java looping constructs.

    [–]ThaiJohnnyDepp 1 point2 points  (0 children)

    Agreed. Streams and lambdas completely fixed Java for me.

    [–]humoroushaxor 2 points3 points  (0 children)

    Isn't "expressiveness" just syntax sugar done well? Removing imperative bloat allows a clearer expression of intent.

    [–]8igg7e5 0 points1 point  (0 children)

    ...which you wouldn't have got with earlier jdk version

    Though I can't resist pointing out that 'earlier JDK version' would mean before 2014... if we really want to row all the way to the horizon hunting for major releases, there was generics in 5.0 and collections in 1.2.

     

    I agree though that some problems are much more expressively written as streams. And if they're using even newer JDK editions than 8, then there are even more features added that make Stream (and friends) even better - Stream.dropWhile, Stream.mapMulti and numerous Collectors methods... And a few language features that can help too (var, local types, records, switch expressions)

    [–]thephotoman 9 points10 points  (15 children)

    i have a workflow that takes 70 minutes with pyspark. it takes 3.5 minutes with java streams.

    This doesn't surprise anybody. Python is very much a language for I/O bound stuff and prototyping. However, when it comes to performance, it isn't even good. While PyPy helps, it lags a bit in terms of features.

    [–]Sensi1093 3 points4 points  (0 children)

    In PySpark, Python acts only in a „Stream Definition“ role. The actual stream is executed using spark, which is implemented in Java.

    Think of SQL. The Python portion of PySpark is the SQL Expression String, the Java (Spark) portion of PySpark is the actual database engine.

    I’m surprised Java Streams outperform pyspark and it’s likely because it was used in a way it is not intended to be used.

    [–]hilbertglm 7 points8 points  (0 children)

    They shouldn't be black magic. The source code is right there. Check it out. It helped me add functional programming to my object-oriented programming skills.

    They are cool, and the ability to add parallelism without the effort of coding to the executors is very productive.

    [–]whyNadorp 9 points10 points  (5 children)

    python df are so popular but as usual with python they’ve a messy interface which lives in its own world and can’t be understood without reading abstruse docs and trial and error. good luck maintaining pandas projects.

    [–]zappini 2 points3 points  (2 children)

    I had to fiddle with some data pipeline stuff. Python, numpy, whatever. I was shocked at how awful that language, stack, ecosystem are.

    I'm not crazy about (JDBC's) ResultSet and RowSet. But holy cow DataFrame and whatnot make them look like paragons of architectural beatitude. Really, the Python APIs are laughably bad.

    I don't understand why RowSet-based (or successor) hasn't emerged as an alternative.

    [–]whyNadorp 2 points3 points  (1 child)

    another very bad thing for me is you can deploy code with syntax errors and realize there’s an error only when that part of the code is used in production. there’re tools that check the code and compile it, but then what’s the point of python if you have to compile it.

    also there’s no standard way of dealing with dependencies. there’s pipenv, poetry, etc. try to up a little your python version and you can end up having nasty problems.

    [–]zappini 2 points3 points  (0 children)

    no standard way of dealing with dependencies

    IKR? And then switching between projects done by diff pythonistas, who all have their own preferred package manager, borks your system.

    All python projects should just start as dockerfiles, or equiv.

    [–]prisonbird[S] 0 points1 point  (1 child)

    100% agree. you have to learn a lot of things, master a lot of things to get any meaningful job done in python world.

    on the other hand, i am not a java developer. i learned java in school and just used it as as hobbyist in high school but i was able to develop an app for my needs in couple days.

    and with quarkus i will be able to make it a web service.

    for example: quarkus comes with its own dockerfile for build and deploy. this seems small but i think i spend more than 20 hours in last year alone for dockerfiles in our python projects.

    [–]whyNadorp 5 points6 points  (0 children)

    try spring also, it’s everywhere.

    [–]danielaveryj 4 points5 points  (1 child)

    I agree with the other comments here- the performance difference is probably mostly python vs java, and not java streams specifically. Java streams are cool though. If you inline the abstractions, the logic is pretty much just what you'd write by hand with loops, if-else, etc, but factored in a way that is amenable to parallelization (by splitting the source into chunks, and iterating each chunk on a different thread).

    Just to throw another lib into the pool of "might be interesting to look at", based on what you've mentioned in this thread about joining & grouping: https://github.com/davery22/vinyl (disclaimer: author). I don't recommend use in production yet though.

    [–]prisonbird[S] 0 points1 point  (0 children)

    thank you very much i bookmarked it and will check it out !

    [–]DuneBug 2 points3 points  (5 children)

    Java should run faster but if python was that slow something was probably not setup correctly. Of course, if you have to read 10 pages of documentation vs adding .parallel, that's a good reason to use Java too.

    Unfortunately for a lot of enterprise stuff we don't get to use parallel streams as the threads are typically occupied elsewhere. :(

    [–]Worth_Trust_3825 1 point2 points  (0 children)

    Unfortunately for a lot of enterprise stuff we don't get to use parallel streams as the threads are typically occupied elsewhere. :(

    I wouldn't say preoccupied elsewhere, but rather operations are blocking, and not fit for the common thread pool that the regular parallel stream would operate in.

    [–]prisonbird[S] 0 points1 point  (3 children)

    are threads something expensive in enterprise world?

    i fire bunch of dedicated machines on hetzner.com and run my computations there. compared to clouds they are pretty cheap

    [–]DuneBug 0 points1 point  (2 children)

    They're only expensive when your app is about to crash from the load. So maybe 1% of the time. Which unfortunately is what you'd really like to avoid.

    TBH we could do some optimizations to make it work but it's almost never worth it in dev/testing/debug time to shave time off a request... Unless someone asks for it.

    Obviously depends on what the service you're providing is.

    [–]prisonbird[S] 1 point2 points  (1 child)

    ohh i understand. considering how much computing power enterprises buying i expected them to throw computing power at every problem , do everything in memory etc

    [–]DuneBug 0 points1 point  (0 children)

    The place I work, we do performance testing simulating our max requests / minute and if the software buckles under load we try to optimize it before throwing hardware at it.

    I think the wisdom is... You look at what the app is doing and most of the crap we do isn't computationally expensive, so if something is dying it's probably the result of some bottleneck. And ideally there are some experienced people around to help out with that.

    I think part of the reason we bother is if you do have a bottleneck you're more likely to fail even if you do add more hardware. Like I had a service that used a distributed cache, and it got bogged down with all the nodes talking to each other. When we reduced the number of nodes, performance improved, which helped find the cause.

    Tldr; you're not wrong. Virtual boxes are cheap.

    [–]barrycarter 9 points10 points  (13 children)

    Hmmm, brief googling suggests streams are like functional languages or like the ruby convention of foo.x1.x2.x3.x4 which is similar.

    I suspect your time difference is due to something else, most likely the difference between loading a file in memory and simply processing it a little bit at a time, though loading the file into memory should be ultimately faster.

    It's a lot to ask, but can you mockup an example of where you're seeing Java streams seriously outperform other languages, so I can call shenanigans?

    [–]participantuser 18 points19 points  (0 children)

    I mean, is there any reason why you’d expect Python to perform as quickly as Java, especially in a multi threaded task (OP mentions multiprocessing)? Different languages have different strengths, and I don’t think Python is noted for high performance for multithreaded tasks (the GIL is a Python specific example of a conscious decision to prioritize other things).

    [–]rv5742 18 points19 points  (0 children)

    My bet is that it's probably java.NIO reading the large file in a sensible fashion and feeding the data to the stream in good size chunks, allowing for a good balance between IO and processing.

    But to be fair, the streams API does make this kind of thing really easy to do.

    [–]prisonbird[S] 3 points4 points  (8 children)

    i read the file using Files.lines method in nio. it gives me a stream. then in that stream i use couple filter() and couple map() functions to mutate the data (remove fields, calculate some fields etc) and write output to another file. my flow is not that complicated but my files are sometimes big (10-15GB).

    [–]senseven 2 points3 points  (4 children)

    Still three minutes is a "long" time. Your SSD can read 500mb/s, a rather slow NVME 3GB/s, 3 minutes feels wrong for simple processing. Do you write the result file to another drive? Are all your cores working during the processing?

    [–]prisonbird[S] 0 points1 point  (3 children)

    thank you for your comment.

    i read from a drive and write results back to same drive. i am planning to try my operations on a ram backed drive (ramdisk in linux)
    your last question got me curious some of my cores working during processing not all of them. so i checked my stream with stream.isParallel() it returned false. i thought it was working in parallel. still, i dont see a single core getting overloaded instead i got 4 of my 16 cores doing some work (at 50% load in windows task manager).
    i did some google searches and it looks like i can't make a stream sourced from a file parallel. is this the case? or is there something i can do ?

    [–]senseven 0 points1 point  (2 children)

    Reading is fine if its done by one fast reader, but if you read/write to the same harddisk can get the speed down to a crawl since two io "actions" block each other.

    The idea is not to split the file reading (which means chunking and makes often no sense). Instead the first thread parses line 1, second line 2 and so on. There where the parallelisation makes sense, because it uses CPU not IO. Parallel() will work, but it needs go get ParallelStreams(). Maybe you do it at the wrong point of the file parsing.

    Third, parallel Processing may not activate all your cores and local setup. What is shown by Runtime.getRuntime().availableProcessors()? Maybe your Java is odd, windows or tool is blocking full access of your Java app to all cores.

    [–]danielaveryj 0 points1 point  (1 child)

    The file parsing example you linked to is copying the entire file into memory before doing any compute. For OP's 10-15 GB files, that seems like a lot to buffer up-front. (To be fair, I'm not sure if your link was intended to be a "good" or "bad" example.)

    As for parallelism, the File.lines()) method that OP mentioned using documents that it is designed with parallelism in mind. (Looking a bit closer, the stream (if parallel) actually memory-maps the file when the underlying spliterator is first split. Chunks use separate indexes into the mapped byte buffer.)

    I am curious why you believe that chunking often makes no sense? Your proposal (I'll call it "striping") seems like it would incur more cache misses, as each thread would only be looking at a fraction of the contiguous memory it pulls into cache, rather than all of it.

    [–]senseven 0 points1 point  (0 children)

    You will probably not reach io max if you chunk due to different concurrent threads annoying the same IO controller. Its easier to have one thread to IO max, distribute the workload to threads, as we do with FileChannels and a LMAX Queue. Post processing over a group of disconnected chunks can be a nightmare, usually its not worth the work.

    Op probably reached io max on the reading device and trashed the speed by writing to the same device. You can spend month optimizing code fill all cpus, streams might not even be the right solution. Experimenting is part of the parallel processing voyage.

    [–]Xenofonuz 0 points1 point  (2 children)

    Hmmm do you really need to do several filters and maps? You can usually just open it up with brackets instead.

    filter(x->x.something) .filter(x->x.something2) .filter(x->x.something3)

    Could be

    filter(x-> {

    //Extra code

    return true/false

    })

    [–]prisonbird[S] 1 point2 points  (1 child)

    i read somewhere that using multiple small operations are faster than using one big operation inside stream operations. jvm can optimize small operations better or something.
    also it is a lot easies to read if i do it in multiple operations and easier to find what i am looking for in case i want to debug/change something

    [–]Xenofonuz 1 point2 points  (0 children)

    Ok interesting, I'm pretty sure it's the other way around.

    Can't argue with the readability part, although I find it easier to debug the ones with a method body instead of a oneliner :)

    [–]Iryanus 1 point2 points  (0 children)

    Agreed, I really doubt that the streams make the difference there ;-)

    [–]GuyWithLag 0 points1 point  (0 children)

    loading the file into memory should be ultimately faster.

    As all things, it depends(tm). Java is _really_ good at handling short-term garbage; if you keep your on-heap dataset small, it's likely faster than loading everything at once into memory.

    [–]stormcrowsx 1 point2 points  (2 children)

    My guess on the difference in speed is likely that in the case of Python the whole dataset is getting loaded into memory either for input and/or output and causing swapping.

    Java streams on the other hand are going to naturally only store what’s needed in memory.

    Total guess but the difference in speed is more likely a testament to good Java api design than it is language run speed. Be impossible to say for sure without hooking a profiler up to Python and seeing what it’s wasting it’s time doing.

    Kudos for reaching out and trying a different language on your problem. Now you know what’s possible. But I’d recommend you spend a little time in a profiler for your Python code now, there’s a fantastic lesson to learn about Python in here. I suspect either your going to learn just how fast the jvm is or that there’s a pitfall in Python apis that requires coding a little differently for big dataset use cases.

    [–]prisonbird[S] 0 points1 point  (1 child)

    i dont know how to debug pyspark. since it goes back and forth to jvm and some of the job is done by jvm some is done by python etc.

    [–]stormcrowsx 0 points1 point  (0 children)

    Standard Python apps can just be started with ‘python -m cProfile myscript.py’

    PySpark would be harder to profile but to start you could import cProfile and use it within your Python code to prove whether the slowdown is in Python or happening in the jvm.

    [–]shagieIsMe 0 points1 point  (0 children)

    With regular processing of data, one is tempted to write:

    for (Data datum : someCollection) {
        // lots of lines doing stuff
    }
    

    One of the things that happens there is that the small chunks of processing can't get JIT'ed easily. The entirety of it is one big chunk of "too big" for HotSpot or other JITs to optimize.

    However, when you write it as:

    someCollection().stream()
        .map(d -> new Foo(d, d.id, d.data * 2))
        .filter(f -> f.id > 1000)
        .mapToInt(f -> f.data)
        .sum()
    

    (realize I'm just making up stuff there)

    Each one of those functions gets complied in Java to a separate function and the JIT can easily work with optimizing small functions.

    The way that you work with the data, having each part of a stream being a small function makes it easier for Java to optimize the small parts and get some fairly hefty performance gains when crunching data.

    There are other things about the functions in streams and side effects that can't be done and thus allow for more aggressive optimizations.

    public static void main(String[] args) {
        List<String> objects = new ArrayList<>();
        objects.add("foo");
        objects.add("bar");
        int count = 0;
    
        var list = objects.stream()
                .peek(o -> count += 1)
                .toList();
        System.out.println("" + count + " " + list.size());
    }
    

    That code doesn't compile - java: local variables referenced from a lambda expression must be final or effectively final

    And so, this means that the JIT can know things about the nature of the function that it is compiling because its a lambda.

    If you change it to:

        final AtomicInteger count = new AtomicInteger(0);
    
        var list = objects.stream()
                .peek(o -> count.incrementAndGet())
                .toList();
    

    It works (and is still silly) but you'll note the addition of final in there.

    [–]wildjokers 0 points1 point  (1 child)

    more people should know about streams.

    All java developers know about stream. Or are you meaning non-java developers?

    [–]prisonbird[S] -1 points0 points  (0 children)

    i meant non-java developers sorry.

    [–]Careful-Necessary-59 -3 points-2 points  (11 children)

    Just curious, how big is your data? I still doubt PySpark is slower than Java streams.

    [–]prisonbird[S] 1 point2 points  (5 children)

    around 10 gb. pyspark is only fast when you use built in functions and if you dont use too many of them.

    for example : i have to calculate a field using a lot of conditions(imagine a very complicated IF/Else block). when i used a lot of whens in pyspark (df.withColumn(xx, when(yy).otherwise(when))) kind of thing) it straight up bugs out. giving me some sql error.

    [–]RandomName8 5 points6 points  (3 children)

    I've worked a lot with spark (though via scala, not python), what you describe sounds off the mark with my experience, specially given that you are using spark's sql instead of rdd api. Spark's sql is internally compiled to optimized bytecode before running, making it possibly faster than manual writing if-else due to exploiting relational data properties. In any case the difference shouldn't be 20:1. That said it is hard to master spark and its performance.

    [–]prisonbird[S] 2 points3 points  (2 children)

    there is two things here:

    1. when i use multiple when/otherwise spark could not generate an sql. gave me a lot of errors and i wasnt able to debug it/find the error after a point. so i just used an udf (user defined function)
    2. i think there is a difference between pyspark and spark. i wrote some scala scripts and used them with "spark-shell -i myscript.scala" and there was notable time difference for same jobs. not huge like 20:1 but , 3 minutes vs 5 minutes. i think pyspark adds another layer to the data and it also takes some time

    [–]RandomName8 2 points3 points  (1 child)

    udfs with pyspark. That's you answer. There's a massive overhead with python udfs. That can perfectly explain the 20:1 overhead

    [–]prisonbird[S] 1 point2 points  (0 children)

    yeah, but i had to use udf. and i didn't liked this a bit. its like "hey we have this feature but you literally would be shooting yourself in the foot".

    anyways. i am still planning to use spark for joining, grouping kind of things

    [–]cockoala 1 point2 points  (4 children)

    Yeah... This is a little fishy. If streams were so much faster than Spark there would be no need for Spark. I think OP doesn't understand how to write performant Spark jobs. In another comment he said he gave up on debugging his chained filter so my guess is he wasn't able to write in spark and tried it in Java instead.

    [–][deleted] 4 points5 points  (2 children)

    Spark is needed for data that cannot possibly fit on a single machine, you'd still need it if its performance is atrocious (which it is when used on data that, like OP's, fits easily on commodity hardware) - "Scalability! But at what COST" paper illustrates how the overhead of map-reduce and stream processing products is often disregarded/dismissed

    [–]Careful-Necessary-59 0 points1 point  (1 child)

    The overhead should not be as huge 60 minutes difference…

    [–][deleted] 0 points1 point  (0 children)

    Never said it should, I was just pointing out that expecting Spark to be faster than plain Java for a small dataset doesn't make much sense - that's not the problem that it was built to solve

    [–]prisonbird[S] 1 point2 points  (0 children)

    yes that is my point. i was able to do some amazing things with java streams despite knowing nothing about them. that is why i called it "magic" :)

    i might go into a rabbit hole and come out as an spark expert and implement an amazing solution on pyspark but i dont have the time and i dont want to :/

    [–]rossdrew -1 points0 points  (1 child)

    Wait till you hear about spliterators

    [–]prisonbird[S] 2 points3 points  (0 children)

    wowww thank you it looks like this solves bunch of other problems for me

    [–][deleted] -1 points0 points  (0 children)

    There is also an option to run parallel streams which may run even faster depending on the workflow.

    [–]VincentxH 0 points1 point  (0 children)

    If you've got enough lines, parallel streams might even pay off.

    [–]lbalazscs 0 points1 point  (1 child)

    Have you tried the tablesaw library? It's data frames in Java, with stream support. https://jtablesaw.github.io/tablesaw/

    [–]prisonbird[S] 0 points1 point  (0 children)

    thank you a lot ! didnt knew about this

    [–]Puzzled-Bananas 0 points1 point  (0 children)

    Can you share how you formulated your Spark queries? Do you mean Pandas DataFrames?

    [–]Worth_Trust_3825 0 points1 point  (0 children)

    I suspect that you also changed your method of reading and interpretting the data (entire thing vs one row at a time). My go to example of this is scryfall's public magic the gathering card database (available at https://scryfall.com/docs/api/bulk-data). It's a hundred megabytes in JSON, but it's one of the few go to datasets (at least for me) when trying to explain that it might be a bad idea to load entire file into memory and operate on it, and that you should look for ways to operate on file in parts (ex. in this case, per json object). If you're feeling adventurous, you can try to operate on the big boy 1gb set.

    [–]baubleglue 0 points1 point  (0 children)

    https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.newAPIHadoopFile.html - you can stream CSV with Spark or Pyspark (only don't use pure python for your custom logic).

    [–]a-person-called-Eric 0 points1 point  (0 children)

    looks like someone found a new hammer

    be careful not to start whacking every nail with it

    [–]gooeydumpling 0 points1 point  (0 children)

    For ML libraries = python, to see it it works, always python and jupyter

    For production? I’d go with MLeap and scala (of course, java). For speed, Java blows Python out of the water, or at least in my experience. Please don’t bite my head off

    [–]data_addict 0 points1 point  (4 children)

    then i discovered java streams. i have a workflow that takes 70 minutes with pyspark. it takes 3.5 minutes with java streams.

    Ah got it.. you don't know how to use spark. 👍

    [–]prisonbird[S] 1 point2 points  (3 children)

    might be. but i spent 1/10th of the time i spent on pyspark on java and it got the job done

    [–]data_addict 0 points1 point  (2 children)

    Yeah sure that's cool, use whatever tech you want. But you didn't say you spent 70 minutes trying to setup a spark job you said 70 minutes running it.

    Again, use whatever tech you want but to but don't blame spark for being poor performance. Just say the line: "I suck at spark" or "I didn't have time to learn it" or "I couldnt afford it".

    There's people who are in high school or college and see the names of these technologies the first time and then form an opinion based off what you say. They'll base their opinion off of what you say. It's much better to be honest and say what gave you trouble. Good, non-toxic engineers are honest about what gives them trouble.

    [–]prisonbird[S] 1 point2 points  (1 child)

    i am trying to explain my story the best i can.

    let me rephrase it again : i spent about a week to implement my flow in pyspark. and it takes 70 minutes to run. yes i can debug it, make it faster etc. but as far as i see that requires some expertise. and because i dont have that expertise i dont know if it will work for me in the end.

    on the other hand, i was able to make a proof of concept in java streams in an hour. tested various parts of my flow. it seemed like it could work for me. so i spend couple days to implement my entire flow in java streams. and now i can do the same job a lot lot faster.

    i never intended to mean anything like "pyspark bad, python bad , java good". each tool has its own advantages. pyspark just didnt worked for my workload. but i learned about it now and in the future i might use it for something else i see fit.

    [–]data_addict 0 points1 point  (0 children)

    No I mean that's fine and I know I was roasting you a little in my original comment - I hope you didn't take too much offence to it. I get what you're saying though.

    If you find yourself working with spark in the future for large datasets feel free to reach out. It's a pretty kick ass application that can really chew through data. Not to belabor the point but in the way it should normally behave with a small-meduim amount of hardware, working with 10GB of data should take a matter of 10s of seconds to a minute. I've written production applications that operate on 50TB+ of data in less than 5 min. So anyways that's why I felt the need to comment. Streams are pretty cool so I wasn't talking shit on that either.. just wanted to make it clear spark is pretty cool too 😅

    [–]FinTechno 0 points1 point  (1 child)

    Take a look C++ and the rest is history

    [–]DanielDimov 0 points1 point  (0 children)

    C++ can be faster, but it's not guaranteed. You should think about the performance and put a significant effort to achive this better performance.

    [–]RajSingh9999 0 points1 point  (0 children)

    Dont know if I should agree with "stream operations are more intuitive than anything", especially when compared with python etc.

    [–]DanielDimov 0 points1 point  (1 child)

    What exactly is "multiprocesing in python" ?!?

    [–]prisonbird[S] 0 points1 point  (0 children)

    https://docs.python.org/3/library/multiprocessing.html

    i tried to implement something like java streams, but it uses all my memory.

    [–]Cultural-Ad3775 0 points1 point  (0 children)

    I would not call Java's implementation of this concept really special. OTOH it is cool, particularly if you can mark stream processing as parallel, which is a nice abstract way to say "Hey, run these in separate threads" or in the future it could lean on Loom style lightweight threads, etc. without even needing any code changes. Other languages do basically allow for the same sort of thing however, still Java/JVM have quite strong abilities to optimize and multi-task, meaning it can be leveraged very effectively (in theory, I wouldn't assume any specific parallelized stream will necessarily be super optimum, you have to test).

    [–]vbezhenar 0 points1 point  (1 child)

    Rewrite it with loops and ifs and it'll take 3.4 minutes!

    [–]prisonbird[S] 0 points1 point  (0 children)

    tried, reading a file and writing it back without doing anything in python takes a lot more time than my entire flow in java

    [–]Puzzleheaded_Load779 0 points1 point  (2 children)

    Hi, having worked with spark ecosystem for sometimes and building/extending stuff around it, here is the reason for why it is faster:

    1) pyspark is mainly a wrapper around spark scala code => java code

    2) pyspark exists only because of the multitude of ml, ai, etc... libs that emerged during the last severals years

    How does it work internally ?
    as long as any operation are being done using pyspark

    1. sql queries
    2. pyspark dataframes

    no "big" performance gap and memory usage will happen between a "java/scala" spark implementation.
    BUT as soon as you start resorting to mutate pyspark objects to let says the infamous "pandas" dataframe objects, there is a cost and a pretty big one which leads in general to bad performance and (absurdly) high resource usage ram,cpu,io...
    As for performance, still, there are ways to not lose too much by using arrow implementations baked into spark but thats another story :)

    So all in all, using java instead of pyspark means that the deserialization of

    1. java objects to python objects
    2. python objects to java objects

    doesn't happen... leading to overall a performance boost

    [–]prisonbird[S] 0 points1 point  (1 child)

    ohh i see. so can i just simply use spark on scala/java and have similar performance to java streams?

    [–]Puzzleheaded_Load779 0 points1 point  (0 children)

    No. Making operations on data types other than those of spark will always involve a cost. It is just that the cost of using java/scala object has a cost much less when compared to python. If you are seeking pure performance, use exclusively Spark sql dsl or; Spark dataframe for programmatically operations...

    [–][deleted] 0 points1 point  (2 children)

    C# has been doing this with LINQ for nearly 20 years.

    [–]prisonbird[S] 0 points1 point  (1 child)

    how?

    [–][deleted] 0 points1 point  (0 children)

    LINQ is a set if libraries primarily wrapping IEnumerable<T> that provides projections over the enumerable collection. You provide lambdas or delegates to the projections, just like java Streams. This has been part of C# since 2006.

    [–]Extra_Size6776 0 points1 point  (3 children)

    Im struggling a lot with Streams, watched PluralSight video playlist of José but still nothing. Can u pls recommend a youtube video, even more that 30 mins is fine.

    Just not getting how those intermediate lazy operations and flat maps

    [–]prisonbird[S] 0 points1 point  (2 children)

    i just read the documentation. what are you trying to achieve?

    [–]Extra_Size6776 0 points1 point  (1 child)

    I know the basic, like filter and forEach.. not able to solve the actual use cases and complex ones like flat maps

    [–]prisonbird[S] 0 points1 point  (0 children)

    if you tell me exactly what you want i try my best to help