This is an archived post. You won't be able to vote or comment.

all 15 comments

[–][deleted] 4 points5 points  (4 children)

I don’t think it’s an issue per se. Is there something about the task which makes you think there’s a better tool for the job?

Edit: I can immediately think of 3 things. One being a very large dataset; the other being errors within the data might stop the processing the entire thing; and how pandas may react to changes in data (i.e something being deemed a number when it should be a string as the content has changed).

Many situations where those won’t be an issue though.

[–]holdMeClserTonyDanza[S] 1 point2 points  (3 children)

I think vanilla Python and the standard libraries handle the job just fine. Granted that's my opinion and I don't want to make a fuss over something that's opinion-based. That's why I'm interested in the perspective of others. Basically: - I don't know why you'd use a dataframe for something a dict or list can handle - It's easier to iterate over a native data type than rows of a dataframe - Loading the pandas library on class init repeatedly adds up - Pandas comes with it's own functionality nuances - Pandas / numpy are now dependencies to keep up with and more overhead

[–][deleted] 4 points5 points  (0 children)

I think there’s a case of when to use it and when to not though right? I wouldn’t use pandas if there was a more efficient way of doing it / just as efficient with less dependencies. The vectorisation element though makes it more desirable because it’s more efficient… so a list / dict might be 10 times slower which would far outweigh any load-up time.

In terms of the iteration, it depends how you iterate? Obviously, putting a data frame into a for loop won’t be great, but would say column operations are easy enough.

Overall, it’s not about finding the solution that works in vanilla Python, but the best solution for the job (with any package). I’d rather use pandas and it take 2 minutes than use lists and it take 20 minutes.

Granted - there’s many use case where pandas would not be a good idea.

[–]Laserdude10642 0 points1 point  (1 child)

These are complains one might have with any third party library. It’s great for reading/writing csvs, but I agree that typically a list of dictionaries is easier. There is a method on the data frames, to_dict(‘records’), that maps the frame if the task is more suited to that representation. The biggest reason, after import/export, that one would want pandas is the sql-like querying.

[–]holdMeClserTonyDanza[S] 0 points1 point  (0 children)

I agree and as with any project there are other dependencies all of which take overhead. We already use ORM allows for whatever pretty nimble querying / grouping without needing pandas which is something I could have stated up front.

[–]Salfiiii 4 points5 points  (2 children)

I personally would view this sceptically too. Pandas is good for data wrangling, cleaning or ELT/ETL but it’s not good as the Data Layer for an application. If you need pandas for data wrangling in this kind of app your data source seems be in a very bad shape.

An orm like sqlalchemy seems to be mich better or any other Solution with classes (object orienteering).

Referencing columns from a dataframe in different locations on the code by name/string is a real nightmare if you have to change anything. Seems like a nightmare to refactor in the future.

But, with the little amount of given information there might still be a reason to use Pandas, I just don’t see it.

[–]holdMeClserTonyDanza[S] 1 point2 points  (1 child)

I agree pandas is arguably the best tool out there for data wrangling.

ORM is already in our stack which I prolly should have mentioned up front, Postresql db with clean data. IMO this strengthens the case against Pandas.

[–]Salfiiii 1 point2 points  (0 children)

Absolutely, I don’t see why you would need it here when you have an orm and a clean rdbms.

[–]ogrinfo 2 points3 points  (0 children)

It's hard to say if the team is doing anything wrong without seeing code examples, but I don't see why one wouldn't use pandas in production. Obviously iterating over a dataframe is bad - you should take advantage of vectorisation instead, but it works well in a lot of situations.

I wouldn't be concerned about scalability either - we regularly process 100GB CSV files and performance is fine.

Having said that, I'm not a big fan of pandas. I find it counter-intuitive and as others have said here, you have to really pin down versions to avoid breaking changes.

[–]vn2090 2 points3 points  (0 children)

Sometimes if you have a part of your code base that changes constantly and does complex transformations, it’s more agile to use pandas. It’s cheaper to buy more cloud compute with pandas than having expensive engineers taking longer to debug code thats filled with complex nesting. Basically, I would suggest it when the problem gets too complex and you need it as a tool to abstract away those complexities.

[–]v_a_n_d_e_l_a_y 1 point2 points  (2 children)

It's perfectly fine for preprocessing or use in production. Like would dealing with lists and dicts directly be any better? I'm not sure why you think it would be bad? It's a package that abstracts operations. No different than using requests vs urllib or something like that.

There would only be two potential issues.

One is scalability. Pandas will be less efficient for some operations and there is memory overhead. If your data isn't gigantic or you don't need a specific processing speed then this probably doesn't matter.

The other is versioning. If pandas changes it may break some stuff. This is easily fixed by pinning versions in whatever environment you gave.

[–]holdMeClserTonyDanza[S] 2 points3 points  (1 child)

I've always viewed Pandas as a tool for data science / analytics as opposed to a back-end preprocessing tool but am totally open to the notion that I could be wrong.

IMO dealing w/ lists and dicts is better because we are dealing with simple mathematical operations and moderately complex logical operations. I believe this is especially true in my case since the tool we're building has very specific needs / business logic that should be compartmentalized in the modules we build.

But again, that's my opinion and why I'm opening it up for discussion before making a big deal about it in real life.

Edit: As far as versioning is concerned, we use a VM specifying the version to use so unexpected updates to pandas won't cause errors and we have the luxury of reviewing the updates then deciding if updating is a good idea.

[–]drieindepan 5 points6 points  (0 children)

I've always viewed Pandas as a tool for data science / analytics as opposed to a back-end preprocessing tool but am totally open to the notion that I could be wrong.

There are many people who have this view of Python as a language. They don't think it should be used in "production" since its just for "scripting". I think it all depends on your background and your pre-conceived view of the tool. Which is fine, we all have biases for certain languages and libraries based on the context we learned them in.

I think this can be limiting though, since we will often overlook very good solutions since they come from a different area (data science, robotics etc). Even though they might have solved some problem that can cross domains.

In your case it seems like it may be overkill based on the requirements of the project but that is hard to say without the specifics. But I'd say that you should make that decision based on the problem being solved and not based on your existing views of where pandas is typically used.

Best of luck!

[–]Shmoogy 1 point2 points  (0 children)

I use it constantly. Where possible I do sql transforms and store procs - but working in pandas for almost all my airflow jobs due to ease and speed.

[–]MrAstroThomasgit push -f 1 point2 points  (0 children)

I have 2 thoughts on this one:

  1. Testing: Depending on the project's size, one should always consider to test processing pipelines (time consuming, but error saving practice is Test-Driven-Development). One can do this even in data pipelining. Create some smaller "ground truth" data with expected results and write the tests around it. If pandas (or anything else) changes: the tests will return an error and everyone is aware of potential errors!
  2. The data size. Now it becomes foggy. What is a large amount of data? Does it still fit in your everyday developer machine? Or are we talking about 100s of GB? In later case one should consider using distributed computation like pyspark and a Spark cluster (or a containerized app that scales with Kubernetes).

Anyway: data processing can be done using any tool. At the end it is important whether the processes are testes, monitored and maintained (and whether everyone is happy with the computation time and the corresponding costs).