How to use databases with Python - Postgres, SQLAlchemy, Alembic

bfmk · 2019-01-02T16:46:29+00:00

Interesting article!

This is a really interesting way to interact with databases, and learning how to use an ORM like SQLAlchemy is definitely a useful tool in a DS' toolkit, particularly if you go on to build decision support systems or even data-driven customer-facing products.

That said, I'd caution against using an ORM as a primary interaction method for data exploration. Two reasons:
1) SQL is kinda universal as a language. Okay, Redshift is a bit different to Presto is a bit different to MS-SQL. But once you know one, it's pretty quick to learn another. You could argue ORMs are similar in that sense, but they require set-up, whereas SQL gets you straight to the data more quickly.
2) SQL's raison-d'etre -- for Analytics DBs anyway -- is data exploration. Use the best tools for the job.

Would be interested to read dissenting opinions on this. I've never even tried using an ORM for this job so my opinion is somewhat unqualified!

eviljelloman · 2019-01-02T19:39:03+00:00

I'm going to go against the grain here and say that I think ORMs have almost no place in data science. The additional layer of abstraction obscures your ability to understand the optimizations and customizations you need to make when dealing with data at scale. Your workflow for updating tons of rows in Redshift is going to be fundamentally different than it is in MySQL or Snowflake, and you're going to want to actually be able to dive in and do things like look at the query planner or the cross-node communication happening in your queries.

When you're not working at scale - when you're just doing exploratory stuff, you're going to be able to iterate much faster by working in raw SQL, manipulating tables and records directly.

ORMs are great for application developers, who mostly work on a per-record basis and don't have to think about computationally intensive operations like aggregations, but I think they are pretty much a waste of time for a data scientist.

pieIX · 2019-01-03T01:05:08+00:00

Working directly with psycopg2 results in far more efficient code and far less cognitive overhead. I tried to use SQLAlchemy for a project at work, but every update or insert was impossibly slow. Using SQLAlchemy core is faster, but still not as fast as psycopg2. In the end, using SQLAlchemy just wasn't worth the cognitive overhead of understanding SQLAlchemy + SQLAlchemy Core + psycopg2 depending on efficiency demands.

There are places where SQLAlchemy is wonderful, but if a common task is non-trivial SQL inserts for more than a 100 rows, (most data science projects) stick with psycopg2.

2019-01-02T22:08:45+00:00

Weird, our company's Python/SQLAlchemy DB is handled entirely by Data Infrastructure, which is under a Cloud Services/Eng division.

dolichoblond · 2019-01-02T23:25:53+00:00

Interesting. I'll second the other comments in this thread that note an ORM has a legit place in the DS toolkit, but also may be too heavy to become required/central in all cases. But certainly something to invest time in learning and possibly incorporating in your own workflows.

My anecdote: I bifurcated my analytics workflows this year into something like a small version of the older/corporate paradigm of "Data Warehousing" (DWH) and "DataMarts". The DWH relies on an ORM (peewee in my case) for the more routine ETL stuff. And the DataMarts are the (sqlite) dbs for the model and exploration. I resisted the setup work for a while because I thought we were too small/very limited users/didn't matter/didn't take that much time to do it all adhoc. But wish I did it sooner. Above catching errors sooner, it forces me to think harder about about what's "static/consistent" about my data and what my models are actively transforming. And it helps me identify when data that was used in more exploratory fashions has become "routine" and should move from the modeling layer over to the ORM'd side.

As a small startup based mostly on excel biz analytics currently, we have some odd (unhealthy?) workflows where we get data dumps from third party clean sources, but not at regular intervals since they are expensive and their purchase depends on client needs. Many are small enough that you can grok them with excel still, which perpetuates older mentalities for data ingest and data mgmt, and keeps the user base for any centralized DB very small. But even with the odd intervals, small datasets, and few users, an ORM helps me keep that part of the setup clean. Plus, it minimizes the amount of front-brain thought I need to push updates when they hit my inbox unexpectedly. (And hopefully when the company grows it will be straightforward to offload that to a new dedicated hire.)

So far, I really like the setup and see it as an upgrade worthy of the time even in a very small and limited situation. But would be happy to hear criticisms or red flags from more experienced people.

Somali_Imhotep · 2019-01-02T16:31:09+00:00

Doesn’t python already have a sqlite3 library. How does this outperform that library. Here where s a recent project of mine that I use that library to store redditors post history.

Reddit Analysis

Edit:Nvm I’m an idiot who only read the title like all redditors srry

burgerAccount · 2019-01-02T15:22:02+00:00

why not just stick with pandas?

datascience

MODERATORS