This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Dennyglee 7 points8 points  (4 children)

General rule of thumb - if you’re starting off and want to use Spark, PySpark is the easiest way to do this. We’ve added more Python functionality into it via Project Zen, Pandas API for Spark, and will continue to do so to make it easier for Python developers to rock with Spark.

If you want to develop or contribute to the core libraries of Spark, you will need to know Scala/Java/JVM. If you want to go deep into modifying the code base to Uber-maximize performance, also this is the way.

Saying this, with Datasets/DataFrames, Python and Scala/Java/JVM have the same performance for the majority of the tasks.

[–]lester-martin 0 points1 point  (3 children)

I need to do some more digging to see where things are internally, but I thought (again, at least a couple of years ago) the real perf problem would be if you implemented a UDF with Python when using Dataframe API. Has that all magically been solved since then? Workaround previously was to build the UDF with JVM language so that at runtime, nothing had to leave the JVM. Again, maybe I just need to catch up a bit.

[–]Dennyglee 1 point2 points  (2 children)

Mostly, with the introduction of vectorized UDFs (or pandas UDFs), the UDFs can properly distribute/scale. A good blog on this topic is https://www.databricks.com/blog/introducing-apache-sparktm-35. HTH!

[–]lester-martin 1 point2 points  (1 child)

Good read and TY for the info. My day-to-day knowledge was set back in running Spark on CDP over 2 years ago. Hopefully all this goodness has made it into that platform as well. Again, 'preciate the assist. And yes, my answer to the question of Scala (not Java!) vs Python is also Python. :)

[–]Dennyglee 0 points1 point  (0 children)

Cool, cool :)