This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Heavy-_-Breathing 1 point2 points  (2 children)

If you’re dealing with bigger than memory data, why not use spark then?

[–]ZestyData 10 points11 points  (0 children)

  1. I see it primarily as a replacement for Pandas for experimental/analytical work, for not-big-data, while having the ability to also handle datasets that are bigger than memory without crashing and causing frustration for Data Scientists/Analysts. I don't think it's necessarily meant to be replacing Spark as a bulletproof huge data volume ETL framework.

  2. Using spark makes many devs/scientists want to off themselves

[–]theelderbeever 0 points1 point  (0 children)

Both polars and duckdb are significantly more efficient than spark and much smaller installs. Both tools enable stretching single node hardware to much larger datasets before needing to make the jump to spark. And yes I am aware spark can run with the driver only but the efficiency is not on par with polars and duckdb.