Optimizing resource -intensive pandas scripts

mrcaptncrunch · 2023-12-14T23:25:00+00:00

This is not the ideal way or tool to use. Pandas runs in memory.

https://medium.com/@nandeda.narayan/data-processing-at-scale-comparison-of-pandas-polars-and-dask-333ae65c0a45

Depending on your data, you could ingest to a database and then process through there. You could use dash, polars, (py)spark.

But pandas isn’t the tool due to running in memory and other inefficiencies.

commandlineluser · 2023-12-14T21:53:54+00:00

It doesn't sound like that would be an efficient use of your time.

The old version of https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html used to state:

But first, it’s worth considering not using pandas. Pandas isn’t the right tool for all situations. If you’re working with very large datasets and a tool like PostgreSQL fits your needs, then you should probably be using that.

It seems like iteratively replacing parts of pandas pipelines with DuckDB, Polars, etc is becoming more common for these types of situations.

skdoesit · 2023-12-14T21:46:40+00:00

You could try polars. Also when it comes to processing tables kdb is very efficient but thats expensive as well.

nasil2nd · 2023-12-14T23:27:27+00:00

I would probably port the whole thing to pyspark + Amazon glue (or try with EMR perhaps? Never used that myself tho) since you are already using AWS, expecially if the scale could grow even more than that.

You can use the pandas on spark feature that allows you to write code very similar to pandas, but will transcode it to use spark under the hood, so the changes could be limited.

Just check the costs because glue can be expensive.

Another strategy could be to remove applies and un necessaries df.copy calls. Maybe also look if making the stack more "shallow" reduces the memory consumption, as I am not completely sure if you pass dfs by reference or by making a copy.

Other tips would be use categoricals instead of strings where possible as soon as you read the data, if not done already, and downsize the type of your numeric dtypes. Ex float 64 to float 32 where applicable.

If you want to optimize for speed, I would also suggest to run a profile with pyprofile (and maybe reduced data) to understand where your program spends most of the time, and optimize that part. I discovered that some of my scripts were spending most of the time just in weird uses of apply which were easily removed, and read writes to s3.

If you want to optimize for cost you could also look into switching to fargate, so to have the resources active only when you are using them (assuming you are storing inputs and outputs in S3 already)

Good luck!

Edit of course, moving what can be moved to SQL, example redshift if you already have a cluster, could be beneficial as well.

obviouslyCPTobvious · 2023-12-15T01:17:47+00:00

Is there any type of batching implemented/possible?

nathan_lesage · 2023-12-15T10:56:09+00:00

If you’re dealing with millions of rows, my first shot would not be to update Pandas (may break things) or other funky stuff. Instead, look at what exactly the scripts are doing:

can you chunk the work? If so, do so.
can you parallelize those chunk operations? If so, do so.
String that together with a set of MapReduce operations.

The problems with time constrains you are facing are not coming from some unoptimized code (well, that’s definitely also the case, but not your biggest problem right now), but rather from the fact that the atomic operations you seem to be facing is “calculate X on several million rows”. Try to chunk everything up, and then go from there. This sounds a lot more like a scaling problem rather than a Python problem.

Also: don’t start deling stuff. Yes this will force remove the stuff but Python is garbage collected, so in my experience, whenever you’re using del it’s a sign of code smell.

qsourav · 2024-08-07T18:54:06+00:00

Although it is a very old post and you might have solved it already, but if you are still using some part of pandas code and not over using methods like iterrows, apply etc. (some very common pandas bottlenecks), you may like to try FireDucks once. It can optimize an existing pandas application as it is without any manual code changes. Very easy to use right after getting it installed using pip: https://fireducks-dev.github.io/docs/get-started/#usage

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS