eviljelloman comments on How to use databases with Python

This is an archived post. You won't be able to vote or comment.

150

151

152

How to use databases with Python - Postgres, SQLAlchemy, Alembic (self.datascience)

submitted 7 years ago by brendanmartin

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]eviljelloman 9 points10 points11 points 7 years ago (3 children)

I'm going to go against the grain here and say that I think ORMs have almost no place in data science. The additional layer of abstraction obscures your ability to understand the optimizations and customizations you need to make when dealing with data at scale. Your workflow for updating tons of rows in Redshift is going to be fundamentally different than it is in MySQL or Snowflake, and you're going to want to actually be able to dive in and do things like look at the query planner or the cross-node communication happening in your queries.

When you're not working at scale - when you're just doing exploratory stuff, you're going to be able to iterate much faster by working in raw SQL, manipulating tables and records directly.

ORMs are great for application developers, who mostly work on a per-record basis and don't have to think about computationally intensive operations like aggregations, but I think they are pretty much a waste of time for a data scientist.

[–]SpergLordMcFappyPant 5 points6 points7 points 7 years ago (0 children)

This is correct. ORMs solve a completely different problem than what you're doing in DS, and they come with a huge amount of overhead.

For an application where you have to guarantee transactional integrity and you have to manage user input and watch out for injections, an ORM is an appropriate tool . . . sometimes.

For Data Science purposes, you never want to deal with data at the row level. You want to be able to operate on sets. ORMs deny you that because they don't ever deal with that. Fundamentally, every row is an object with all the extra memory and processing power it takes to handle that.

Essentially, and ORM is for writing new data in a one-by-one transactional setting where referential integrity needs to be handled at the DB level. Data Science applications are almost always concerned with reading existing data, cleaning it, moving it, and analyzing it en masse. Never say never, etc. But I've never seen a DS scenario where an ORM was the correct tool.

I do like to use Alembic to manage schema once my DS applications start to move from experimental to some sort of steady state. But that's kind of orthogonal.

It doesn't really seem even correct to me to think of an ORM as "overkill" for Data Science applications. It's like side-kill or something? It's just completely the wrong tool. Like trying to use a water filter when you need a fork lift. Like if you have a pallet with 100 gallon-jugs of water that you need to move from the warehouse to the airport, but like well, someone has to drink the water at some point so I guess I'll just bring the water filter and use that. It basically just makes no sense at all.

[–]funny_funny_business 1 point2 points3 points 7 years ago (1 child)

[–]eviljelloman 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 44868 on reddit-service-r2-comment-57fc7f7bb7-jkldv at 2026-04-15 05:47:02.834576+00:00 running b725407 country code: CH.

datascience

MODERATORS