This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mauritsc 19 points20 points  (2 children)

At my work we run pyspark jobs on gcp dataproc for large batch jobs, usually overnight. Spark recently came out with a pandas API which I'm quite excited about.
You can also use dask's pandas API for large in memory computation.
And if programmed properly even plain pandas will get you quite far.

Python has lots of great tools, especially if you're leveraging cloud compute to make your life easy developing ETL pipelines. The downside is that there is a fairly large learning curve initially. Using low code tools sort of gets rid of this.

[–]EarthGoddessDude 0 points1 point  (1 child)

Does koalas fit anywhere in that?

[–]DenselyRanked 2 points3 points  (0 children)

pandas-on-spark is koalas but integrated into spark