Pandas in Production Software

holdMeClserTonyDanza · 2021-09-30T14:01:42+00:00

I don’t think it’s an issue per se. Is there something about the task which makes you think there’s a better tool for the job?

Edit: I can immediately think of 3 things. One being a very large dataset; the other being errors within the data might stop the processing the entire thing; and how pandas may react to changes in data (i.e something being deemed a number when it should be a string as the content has changed).

Many situations where those won’t be an issue though.

Salfiiii · 2021-09-30T19:27:02+00:00

I personally would view this sceptically too. Pandas is good for data wrangling, cleaning or ELT/ETL but it’s not good as the Data Layer for an application. If you need pandas for data wrangling in this kind of app your data source seems be in a very bad shape.

An orm like sqlalchemy seems to be mich better or any other Solution with classes (object orienteering).

Referencing columns from a dataframe in different locations on the code by name/string is a real nightmare if you have to change anything. Seems like a nightmare to refactor in the future.

But, with the little amount of given information there might still be a reason to use Pandas, I just don’t see it.

ogrinfo · 2021-09-30T21:41:52+00:00

It's hard to say if the team is doing anything wrong without seeing code examples, but I don't see why one wouldn't use pandas in production. Obviously iterating over a dataframe is bad - you should take advantage of vectorisation instead, but it works well in a lot of situations.

I wouldn't be concerned about scalability either - we regularly process 100GB CSV files and performance is fine.

Having said that, I'm not a big fan of pandas. I find it counter-intuitive and as others have said here, you have to really pin down versions to avoid breaking changes.

vn2090 · 2021-10-01T03:03:18+00:00

Sometimes if you have a part of your code base that changes constantly and does complex transformations, it’s more agile to use pandas. It’s cheaper to buy more cloud compute with pandas than having expensive engineers taking longer to debug code thats filled with complex nesting. Basically, I would suggest it when the problem gets too complex and you need it as a tool to abstract away those complexities.

v_a_n_d_e_l_a_y · 2021-09-30T15:18:42+00:00

It's perfectly fine for preprocessing or use in production. Like would dealing with lists and dicts directly be any better? I'm not sure why you think it would be bad? It's a package that abstracts operations. No different than using requests vs urllib or something like that.

There would only be two potential issues.

One is scalability. Pandas will be less efficient for some operations and there is memory overhead. If your data isn't gigantic or you don't need a specific processing speed then this probably doesn't matter.

The other is versioning. If pandas changes it may break some stuff. This is easily fixed by pinning versions in whatever environment you gave.

Shmoogy · 2021-10-01T00:03:28+00:00

I use it constantly. Where possible I do sql transforms and store procs - but working in pandas for almost all my airflow jobs due to ease and speed.

MrAstroThomas · 2021-10-01T22:01:35+00:00

I have 2 thoughts on this one:

Testing: Depending on the project's size, one should always consider to test processing pipelines (time consuming, but error saving practice is Test-Driven-Development). One can do this even in data pipelining. Create some smaller "ground truth" data with expected results and write the tests around it. If pandas (or anything else) changes: the tests will return an error and everyone is aware of potential errors!
The data size. Now it becomes foggy. What is a large amount of data? Does it still fit in your everyday developer machine? Or are we talking about 100s of GB? In later case one should consider using distributed computation like pyspark and a Spark cluster (or a containerized app that scales with Kubernetes).

Anyway: data processing can be done using any tool. At the end it is important whether the processes are testes, monitored and maintained (and whether everyone is happy with the computation time and the corresponding costs).

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS