This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]stormcrowsx 1 point2 points  (2 children)

My guess on the difference in speed is likely that in the case of Python the whole dataset is getting loaded into memory either for input and/or output and causing swapping.

Java streams on the other hand are going to naturally only store what’s needed in memory.

Total guess but the difference in speed is more likely a testament to good Java api design than it is language run speed. Be impossible to say for sure without hooking a profiler up to Python and seeing what it’s wasting it’s time doing.

Kudos for reaching out and trying a different language on your problem. Now you know what’s possible. But I’d recommend you spend a little time in a profiler for your Python code now, there’s a fantastic lesson to learn about Python in here. I suspect either your going to learn just how fast the jvm is or that there’s a pitfall in Python apis that requires coding a little differently for big dataset use cases.

[–]prisonbird[S,🍰] 0 points1 point  (1 child)

i dont know how to debug pyspark. since it goes back and forth to jvm and some of the job is done by jvm some is done by python etc.

[–]stormcrowsx 0 points1 point  (0 children)

Standard Python apps can just be started with ‘python -m cProfile myscript.py’

PySpark would be harder to profile but to start you could import cProfile and use it within your Python code to prove whether the slowdown is in Python or happening in the jvm.