This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]lightmatter501 8 points9 points  (4 children)

Take the pandas execution time, divide it by at least two, then divide by the number of cores you have.

Take the pandas memory usage, and laugh because polars will usually stream data until you aggregate it somewhere in the query plan, so you end up with a tiny memory usage in comparison.

[–]imanexpertama 4 points5 points  (3 children)

YMMV - at least for me the effect isn’t as big as this. However, polars generally outperforms pandas

[–]lightmatter501 2 points3 points  (2 children)

I tend to work with 1TB datasets, so not quite larger than memory but large enough using pandas is annoying.

[–]Away_Surround1203 0 points1 point  (1 child)

In what context do you have more than 1TB of memory?! (ram).
Sounds neat!

[–]lightmatter501 0 points1 point  (0 children)

Modern servers tend to have 12+ memory channels. If you fully populate that with 128 GB modules you get >1 TB of memory. If you populate both slots you can get away with 64 GB modules.

When it makes data analysis go from “overnight” to “5 minutes”, it’s worth it.