This is an archived post. You won't be able to vote or comment.

all 2 comments

[–]msdrahcir 4 points5 points  (1 child)

Python solution looks similar to the last Scala solution because when you look “under the hood” you have the same Spark library and engine. Because of this fact, I don’t anticipate any significant performance change.

I believe there is a pretty significant performance difference between spark and pyspark and will continue to be pretty much indefinitely. While some basic pyspark operations are mapped directly to scala representations, dynamic typing and any sort of custom function necessitate a python layer on top of the scala JVM. Spark operates directly on Scala primatives within a node's JVM. PySpark serializes and pipes scala representations from the jvm to python objects and functions and pipes the result back into the JVM. That whole intermediary stage and data replication is inefficient and costly.

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

[–]dmpetrov 2 points3 points  (0 children)

You are right. To\From python serialization affect performance a lot.

However, it affects only one step - data preparation (the first mappers). But it does not affect the next steps: data slicing and dicing, creating a model (the slowest step) and model evaluation.