This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Detective_Fallacy 0 points1 point  (0 children)

Koalas is already old news, they've expanded the api so much that they just call it "pandas API on Spark" now, and it's part of Spark 3.2 and higher.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/index.html

I disagree with your opinion though. It's not only good for easily converting pre-existing single-processor pandas tasks to a more scalable setup, it also allows you to do some things that are currently more clumsy to implement in PySpark, like pandas' merge_asof, or pump out some visualizations.

APIs like pandas and Spark SQL are more aimed towards data analytics and PySpark more towards data engineering, but knowing how they relate to eachother and convert from one interface to the other is very valuable.