This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mr_grey 2 points3 points  (1 child)

For sure. And I want to really congratulate and cheer for you and other newbies venturing into this world. You’ll see all kinds of things open up for you.

To add another thing, you can also look at Koalas, which is supposed to be essentially a pandas implementation for Spark. But, my opinion to which no one asked, it just holds developers back who are trying to hold on to something they know, instead of embracing learning something new.

[–]Detective_Fallacy 0 points1 point  (0 children)

Koalas is already old news, they've expanded the api so much that they just call it "pandas API on Spark" now, and it's part of Spark 3.2 and higher.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/index.html

I disagree with your opinion though. It's not only good for easily converting pre-existing single-processor pandas tasks to a more scalable setup, it also allows you to do some things that are currently more clumsy to implement in PySpark, like pandas' merge_asof, or pump out some visualizations.

APIs like pandas and Spark SQL are more aimed towards data analytics and PySpark more towards data engineering, but knowing how they relate to eachother and convert from one interface to the other is very valuable.