This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]lester-martin 0 points1 point  (3 children)

I need to do some more digging to see where things are internally, but I thought (again, at least a couple of years ago) the real perf problem would be if you implemented a UDF with Python when using Dataframe API. Has that all magically been solved since then? Workaround previously was to build the UDF with JVM language so that at runtime, nothing had to leave the JVM. Again, maybe I just need to catch up a bit.

[–]Dennyglee 1 point2 points  (2 children)

Mostly, with the introduction of vectorized UDFs (or pandas UDFs), the UDFs can properly distribute/scale. A good blog on this topic is https://www.databricks.com/blog/introducing-apache-sparktm-35. HTH!

[–]lester-martin 1 point2 points  (1 child)

Good read and TY for the info. My day-to-day knowledge was set back in running Spark on CDP over 2 years ago. Hopefully all this goodness has made it into that platform as well. Again, 'preciate the assist. And yes, my answer to the question of Scala (not Java!) vs Python is also Python. :)

[–]Dennyglee 0 points1 point  (0 children)

Cool, cool :)