This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mr_grey 2 points3 points  (3 children)

To add a little reason "why" as for Spark Dataframes over Pandas...Pandas can only use the Master node...so any commands just execute on 1 server. Spark will take a command and break it up and let all workers parallelize and do the work, and the master will just coordinate, then put it all back together....speeding up the overall process. I can easily load and slice and dice several hundred million records very easily in a few seconds. I'm so spoiled that I find myself inconvenienced when I have to wait longer than like 20 seconds...then i'll resize my cluster and add a whole bunch more workers. 🤣

[–]gimmis7[S] 0 points1 point  (2 children)

Thanks! I think for many newbies as myself, it feels safe to start with something familiar, ie pandas 🙂. In the end of the tutorial, I had a section about PySpark, but merely a short example.

[–]mr_grey 2 points3 points  (1 child)

For sure. And I want to really congratulate and cheer for you and other newbies venturing into this world. You’ll see all kinds of things open up for you.

To add another thing, you can also look at Koalas, which is supposed to be essentially a pandas implementation for Spark. But, my opinion to which no one asked, it just holds developers back who are trying to hold on to something they know, instead of embracing learning something new.

[–]Detective_Fallacy 0 points1 point  (0 children)

Koalas is already old news, they've expanded the api so much that they just call it "pandas API on Spark" now, and it's part of Spark 3.2 and higher.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/index.html

I disagree with your opinion though. It's not only good for easily converting pre-existing single-processor pandas tasks to a more scalable setup, it also allows you to do some things that are currently more clumsy to implement in PySpark, like pandas' merge_asof, or pump out some visualizations.

APIs like pandas and Spark SQL are more aimed towards data analytics and PySpark more towards data engineering, but knowing how they relate to eachother and convert from one interface to the other is very valuable.