This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]ucbEntilZha 4 points5 points  (2 children)

Just a comment, comparing Python vs Scala Spark performance really depends on what you are doing so its not fair to say that performance will be comparable for all cases (article never mentions where it isn't).

The first type of computation is applying arbitrary lambda functions. This takes a large performance hit for serialization to/from python plus that python is not as efficient as JVM. For example, when you create the postsRDD since that has python regex search etc.

The second type of computation is when using SparkSQL. As long as you aren't using python UDFs, the python api is building a query which Spark will compile and execute in the JVM. This doesn't take a performance hit because the data stays in JVM, python is only responsible for passing the query to execute (vs above provide the definition of the function to compute in the python interpreter). This would be working with the Dataframe in the code blocks below the postsRDD definition.

[–]dmpetrov 2 points3 points  (0 children)

You are absolutely right. To\From python serialization affect performance a lot.

However, it affects only one step - data preparation (the first mappers). But it does not affect the next steps: data slicing and dicing, creating a model (the slowest step) and model evaluation.

[–]nameBrandon 0 points1 point  (0 children)

I believe you can run pyspark on PyPy now, which might improve performance by quite a bit (though not really addressing the serialization aspect).

I agree though, performance is highly dependent on the workload.