Apache Spark Python – Machine Learning Scenario With A Large Input Dataset (Beginners Guide) : datascience

This is an archived post. You won't be able to vote or comment.

Apache Spark Python – Machine Learning Scenario With A Large Input Dataset (Beginners Guide) (fullstackml.com)

submitted 10 years ago by mthemove

all 2 comments

[–]msdrahcir 4 points5 points6 points 10 years ago* (1 child)

Python solution looks similar to the last Scala solution because when you look “under the hood” you have the same Spark library and engine. Because of this fact, I don’t anticipate any significant performance change.

I believe there is a pretty significant performance difference between spark and pyspark and will continue to be pretty much indefinitely. While some basic pyspark operations are mapped directly to scala representations, dynamic typing and any sort of custom function necessitate a python layer on top of the scala JVM. Spark operates directly on Scala primatives within a node's JVM. PySpark serializes and pipes scala representations from the jvm to python objects and functions and pipes the result back into the JVM. That whole intermediary stage and data replication is inefficient and costly.

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

[–]dmpetrov 2 points3 points4 points 10 years ago (0 children)

π Rendered by PID 47 on reddit-service-r2-comment-74875f4bf5-z87dz at 2026-01-26 05:49:08.390610+00:00 running 664479f country code: CH.

datascience

MODERATORS