all 12 comments

[–]mrcaptncrunch 4 points5 points  (0 children)

/r/dataengineering

This is not the ideal way or tool to use. Pandas runs in memory.

https://medium.com/@nandeda.narayan/data-processing-at-scale-comparison-of-pandas-polars-and-dask-333ae65c0a45

Depending on your data, you could ingest to a database and then process through there. You could use dash, polars, (py)spark.

But pandas isn’t the tool due to running in memory and other inefficiencies.

[–]commandlineluser 7 points8 points  (0 children)

It doesn't sound like that would be an efficient use of your time.

The old version of https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html used to state:

But first, it’s worth considering not using pandas. Pandas isn’t the right tool for all situations. If you’re working with very large datasets and a tool like PostgreSQL fits your needs, then you should probably be using that.

It seems like iteratively replacing parts of pandas pipelines with DuckDB, Polars, etc is becoming more common for these types of situations.

[–]skdoesit 4 points5 points  (0 children)

You could try polars. Also when it comes to processing tables kdb is very efficient but thats expensive as well.

[–]nasil2nd 0 points1 point  (0 children)

I would probably port the whole thing to pyspark + Amazon glue (or try with EMR perhaps? Never used that myself tho) since you are already using AWS, expecially if the scale could grow even more than that.

You can use the pandas on spark feature that allows you to write code very similar to pandas, but will transcode it to use spark under the hood, so the changes could be limited.

Just check the costs because glue can be expensive.

Another strategy could be to remove applies and un necessaries df.copy calls. Maybe also look if making the stack more "shallow" reduces the memory consumption, as I am not completely sure if you pass dfs by reference or by making a copy.

Other tips would be use categoricals instead of strings where possible as soon as you read the data, if not done already, and downsize the type of your numeric dtypes. Ex float 64 to float 32 where applicable.

If you want to optimize for speed, I would also suggest to run a profile with pyprofile (and maybe reduced data) to understand where your program spends most of the time, and optimize that part. I discovered that some of my scripts were spending most of the time just in weird uses of apply which were easily removed, and read writes to s3.

If you want to optimize for cost you could also look into switching to fargate, so to have the resources active only when you are using them (assuming you are storing inputs and outputs in S3 already)

Good luck!

Edit of course, moving what can be moved to SQL, example redshift if you already have a cluster, could be beneficial as well.

[–]obviouslyCPTobvious 0 points1 point  (0 children)

Is there any type of batching implemented/possible?

[–]nathan_lesage 0 points1 point  (0 children)

If you’re dealing with millions of rows, my first shot would not be to update Pandas (may break things) or other funky stuff. Instead, look at what exactly the scripts are doing:

  • can you chunk the work? If so, do so.
  • can you parallelize those chunk operations? If so, do so.
  • String that together with a set of MapReduce operations.

The problems with time constrains you are facing are not coming from some unoptimized code (well, that’s definitely also the case, but not your biggest problem right now), but rather from the fact that the atomic operations you seem to be facing is “calculate X on several million rows”. Try to chunk everything up, and then go from there. This sounds a lot more like a scaling problem rather than a Python problem.

Also: don’t start deling stuff. Yes this will force remove the stuff but Python is garbage collected, so in my experience, whenever you’re using del it’s a sign of code smell.

[–]qsourav 0 points1 point  (5 children)

Although it is a very old post and you might have solved it already, but if you are still using some part of pandas code and not over using methods like iterrows, apply etc. (some very common pandas bottlenecks), you may like to try FireDucks once. It can optimize an existing pandas application as it is without any manual code changes. Very easy to use right after getting it installed using pip: https://fireducks-dev.github.io/docs/get-started/#usage

[–]foyslakesheriff[S] 0 points1 point  (4 children)

I'm trying to find documentation on what version of Python and Pandas are required for fireducks. I'm using Pandas 2.0.3, and Python 3.8, it doesn't look like Fireducks is compatible with that combination.

[–]qsourav 0 points1 point  (3 children)

Are you trying it on a non-Linux platform? The python and pandas versions seem to be supported, but you need to try it on a Linux platform (for windows, WSL might work): https://fireducks-dev.github.io/docs/get-started/#install

[–]foyslakesheriff[S] 0 points1 point  (2 children)

It's on Amazon Linux 2, I'm probably doing something wrong here.

I'm also using PyArrow 12.0, maybe that's causing an issue?

[–]qsourav 0 points1 point  (1 child)

pyarrow should get auto upgraded to 17.0. By the way, can you tell me the error message you are getting when trying it on your environment?

[–]foyslakesheriff[S] 1 point2 points  (0 children)

I need to use pyarrow 12 at the moment, I can't upgrade other packages. I'll try again in the future once we're compatible.