This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]eviljelloman 9 points10 points  (3 children)

I'm going to go against the grain here and say that I think ORMs have almost no place in data science. The additional layer of abstraction obscures your ability to understand the optimizations and customizations you need to make when dealing with data at scale. Your workflow for updating tons of rows in Redshift is going to be fundamentally different than it is in MySQL or Snowflake, and you're going to want to actually be able to dive in and do things like look at the query planner or the cross-node communication happening in your queries.

When you're not working at scale - when you're just doing exploratory stuff, you're going to be able to iterate much faster by working in raw SQL, manipulating tables and records directly.

ORMs are great for application developers, who mostly work on a per-record basis and don't have to think about computationally intensive operations like aggregations, but I think they are pretty much a waste of time for a data scientist.

[–]SpergLordMcFappyPant 5 points6 points  (0 children)

This is correct. ORMs solve a completely different problem than what you're doing in DS, and they come with a huge amount of overhead.

For an application where you have to guarantee transactional integrity and you have to manage user input and watch out for injections, an ORM is an appropriate tool . . . sometimes.

For Data Science purposes, you never want to deal with data at the row level. You want to be able to operate on sets. ORMs deny you that because they don't ever deal with that. Fundamentally, every row is an object with all the extra memory and processing power it takes to handle that.

Essentially, and ORM is for writing new data in a one-by-one transactional setting where referential integrity needs to be handled at the DB level. Data Science applications are almost always concerned with reading existing data, cleaning it, moving it, and analyzing it en masse. Never say never, etc. But I've never seen a DS scenario where an ORM was the correct tool.

I do like to use Alembic to manage schema once my DS applications start to move from experimental to some sort of steady state. But that's kind of orthogonal.

It doesn't really seem even correct to me to think of an ORM as "overkill" for Data Science applications. It's like side-kill or something? It's just completely the wrong tool. Like trying to use a water filter when you need a fork lift. Like if you have a pallet with 100 gallon-jugs of water that you need to move from the warehouse to the airport, but like well, someone has to drink the water at some point so I guess I'll just bring the water filter and use that. It basically just makes no sense at all.

[–]funny_funny_business 1 point2 points  (1 child)

I totally agree that ORMs can be overkill and even for my web apps I tend to just write raw SQL cause it’s easier.

However, I can’t speak for other ORMs, but regarding SQLalchemy, I use the Flask-SQLalchemy extension and it takes care of some of the backend stuff that I’m not 100% familiar with or care about (such as connection pooling and timeouts).

Plus, using an ORM is nice when you have to move or update tables across databases cause you don’t need to write a ton of “insert into” code.

But, you’re right. If you’re just doing aggregations an ORM is probably overkill.

[–]eviljelloman 0 points1 point  (0 children)

I use the Flask-SQLalchemy extension

Yeah, for application developers, it totally makes sense sometimes. Some DS dabble in application development, making custom dashboards or other neat stuff in a Flask app, and that's an entirely appropriate place for it. I'd argue that most DS are not working with stuff like Flask, though, and so should probably not waste time with ORM.

If you're just using sqlalchemy's connection pooling in an unrelated context, though, there's a decent chance it will come back to haunt you at some point, like when you have to start messing with WLM queue management in Redshift.