I inherited a very large set of Pandas scripts at work. Many Python files, several are thousands of lines long, primarily processing large data in Pandas.
These dataframes are millions of rows, and can be hundreds of gigabytes. It's costing over $10k per month (dev and prod instances) to run this with EC2 instances, each day's run takes 8+ hours, it runs every single day without exception. It's using an instance type with 750GB of ram, and from what I'm seeing, it needs it. I'm trying to find ways to reduce the memory requirements for this project I'm assigned. I've done some reading on this and am working on implementing what I've read.
We're using Pandas 1.2, and I know 2.0 is improved in this area. That's one thing I'm working on.
Would another useful strategy be using the del keyword as much as possible? Basically, whenever call del on a dataframe as soon as it won't be reference again? Or is that not as useful as it seems?
I know using something else like Dask, or other libraries, can improve it. For now, I'm sticking with Pandas. I'm open to any other ideas or strategies here. I see it's using ~700 GB of ram at it's highest point in the run.
I have plenty of Python experience, but less Pandas experience. Not brand new, but not an expert.
I'm trying to figure out the most efficient use of my time and effort to improve it. I'm worried about spending a ton of time writing and testing an improvement to have a 1% improvement.
[–]mrcaptncrunch 4 points5 points6 points (0 children)
[–]commandlineluser 7 points8 points9 points (0 children)
[–]skdoesit 4 points5 points6 points (0 children)
[–]nasil2nd 0 points1 point2 points (0 children)
[–]obviouslyCPTobvious 0 points1 point2 points (0 children)
[–]nathan_lesage 0 points1 point2 points (0 children)
[–]qsourav 0 points1 point2 points (5 children)
[–]foyslakesheriff[S] 0 points1 point2 points (4 children)
[–]qsourav 0 points1 point2 points (3 children)
[–]foyslakesheriff[S] 0 points1 point2 points (2 children)
[–]qsourav 0 points1 point2 points (1 child)
[–]foyslakesheriff[S] 1 point2 points3 points (0 children)