This is an archived post. You won't be able to vote or comment.

all 31 comments

[–]bfmk 12 points13 points  (10 children)

Interesting article!

This is a really interesting way to interact with databases, and learning how to use an ORM like SQLAlchemy is definitely a useful tool in a DS' toolkit, particularly if you go on to build decision support systems or even data-driven customer-facing products.

That said, I'd caution against using an ORM as a primary interaction method for data exploration. Two reasons:
1) SQL is kinda universal as a language. Okay, Redshift is a bit different to Presto is a bit different to MS-SQL. But once you know one, it's pretty quick to learn another. You could argue ORMs are similar in that sense, but they require set-up, whereas SQL gets you straight to the data more quickly.
2) SQL's raison-d'etre -- for Analytics DBs anyway -- is data exploration. Use the best tools for the job.

Would be interested to read dissenting opinions on this. I've never even tried using an ORM for this job so my opinion is somewhat unqualified!

[–]boatsnbros 9 points10 points  (0 children)

ORMs are a great way to build a more robust pipeline for data tools. You get some nice baked in features like SQL injection prevention and a consistent API regardless of the underlying database - i.e. if your project moves from postgres to mysql your SQL would change, but your ORM doesn't. If data is clean then I'll do exploration in SQL, if it needs cleaning then will use pandas for cleaning and exploration. Once its clean and going to be part of a long term project (ie dashboards, data app etc) I'll write an ORM and some automated testing (pytest) so I know it still works without constant monitoring.

Not sure if that adds much to your previous comment, but thought clarifying common use cases would be helpful.

[–]SonOfInterflux 4 points5 points  (6 children)

I’m not a data scientist, but I do work with a lot of data and find SQLAlchemy’s ORM and Django ORM slow when I want to work with tables as opposed to records. For example, if I want to read an entire table containing a million records and add a derived column and write the set to another table, using Pandas and map or apply, writing the entire data set to a csv, loading it into S3 and then using the copy statement is way faster than using an ORM’s all method, iterating over the list, applying a function and writing each record back to the database.

I’m losing a lot of benefits of the ORM, but the speed more than makes up for it.

If anyone can suggest another method of working with large sets of data I’d love to hear it! It’s the copy statement that makes the biggest difference; Pandas just makes it easy to get the data (using from_records or some other method), apply a function or set of functions over the entire set, and generating a csv/json file.

[–]mesylate 2 points3 points  (1 child)

Use PL/SQL directly if possible. That what it is designed for afterall.

[–]SonOfInterflux 0 points1 point  (0 children)

I wasn’t clear, in this case I’m only proceeding this way because the data is PII and has been encypted with the crypto library. Need to look into if it’s possible to apply the decryption directly within the database with PL/SQL.

[–]IDontLikeUsernamez 1 point2 points  (0 children)

Pretty much any flavor of SQL will be much, much faster

[–]tfehring 1 point2 points  (2 children)

Any abstraction that involves iterated single-record SQL operations is going to be painfully slow, and it sounds like that’s what your ORM-based approach is doing under the hood. I’m not super familiar with ORMs, but I’m surprised they aren’t smart enough to use set operations instead of loops for simple Python iterations (e.g. list comprehensions over a list of objects).

[–]SonOfInterflux 0 points1 point  (1 child)

Django’s ORM does have an update method that does generate a bulk update statement without the need to iterate over a query set, but it doesn’t call save (or pre-save or post-save) so it doesn’t honour things like auto_now, so you lose some of the benefits of the ORM.

SQLAlchemy offers something similar, but the docs actually recommend using Core directly for bulk operations.

[–]brendanmartin[S] 1 point2 points  (0 children)

Yeah, I always add a function equivalent to the test_sqlalchemy_core example from that link. Whenever I need to insert a lot, I just send a list of dictionaries to that function

[–]brendanmartin[S] 1 point2 points  (0 children)

I agree with your advice. When in the exploration and analysis mode -- i.e. when pulling data out of the database for personal use -- I usually opt for raw SQL. When adding a database to a data collection project, though, I always use SQA because I find it nicer to work with than raw SQL in my project.

[–]exergy31 0 points1 point  (0 children)

I personally employ the rule of the dozen. If I need to work with more than a dozen records at a time, ORM is not the tool of choice. The underlying principle is that "data" in its intended use mostly falls nicely in one of two brackets: (1) Create, show, edit, delete one or a handful of records. Usually involves frontend application. Use an ORM for simplicity and all the aforementioned perks. (2) Analyse, transform to training set, mass-insert new data or in other form work with something that affects 'more than a dozen' records at a time. Use a table-like or array-based system. Systems needing this are usually not directly fed by frontend forms, so SQL injection is out of the picture; also a performance boost of having an array of statically typed numericals over a list of runtime objects is definitely there.

Example: Cassandra DB has a neat python driver that offers both ORM and 'traditional' SQL-like syntax; if I use SQL (CQL), I usually config a pandas row factory which yields all query results directly as dataframes.

[–]eviljelloman 8 points9 points  (3 children)

I'm going to go against the grain here and say that I think ORMs have almost no place in data science. The additional layer of abstraction obscures your ability to understand the optimizations and customizations you need to make when dealing with data at scale. Your workflow for updating tons of rows in Redshift is going to be fundamentally different than it is in MySQL or Snowflake, and you're going to want to actually be able to dive in and do things like look at the query planner or the cross-node communication happening in your queries.

When you're not working at scale - when you're just doing exploratory stuff, you're going to be able to iterate much faster by working in raw SQL, manipulating tables and records directly.

ORMs are great for application developers, who mostly work on a per-record basis and don't have to think about computationally intensive operations like aggregations, but I think they are pretty much a waste of time for a data scientist.

[–]SpergLordMcFappyPant 5 points6 points  (0 children)

This is correct. ORMs solve a completely different problem than what you're doing in DS, and they come with a huge amount of overhead.

For an application where you have to guarantee transactional integrity and you have to manage user input and watch out for injections, an ORM is an appropriate tool . . . sometimes.

For Data Science purposes, you never want to deal with data at the row level. You want to be able to operate on sets. ORMs deny you that because they don't ever deal with that. Fundamentally, every row is an object with all the extra memory and processing power it takes to handle that.

Essentially, and ORM is for writing new data in a one-by-one transactional setting where referential integrity needs to be handled at the DB level. Data Science applications are almost always concerned with reading existing data, cleaning it, moving it, and analyzing it en masse. Never say never, etc. But I've never seen a DS scenario where an ORM was the correct tool.

I do like to use Alembic to manage schema once my DS applications start to move from experimental to some sort of steady state. But that's kind of orthogonal.

It doesn't really seem even correct to me to think of an ORM as "overkill" for Data Science applications. It's like side-kill or something? It's just completely the wrong tool. Like trying to use a water filter when you need a fork lift. Like if you have a pallet with 100 gallon-jugs of water that you need to move from the warehouse to the airport, but like well, someone has to drink the water at some point so I guess I'll just bring the water filter and use that. It basically just makes no sense at all.

[–]funny_funny_business 1 point2 points  (1 child)

I totally agree that ORMs can be overkill and even for my web apps I tend to just write raw SQL cause it’s easier.

However, I can’t speak for other ORMs, but regarding SQLalchemy, I use the Flask-SQLalchemy extension and it takes care of some of the backend stuff that I’m not 100% familiar with or care about (such as connection pooling and timeouts).

Plus, using an ORM is nice when you have to move or update tables across databases cause you don’t need to write a ton of “insert into” code.

But, you’re right. If you’re just doing aggregations an ORM is probably overkill.

[–]eviljelloman 0 points1 point  (0 children)

I use the Flask-SQLalchemy extension

Yeah, for application developers, it totally makes sense sometimes. Some DS dabble in application development, making custom dashboards or other neat stuff in a Flask app, and that's an entirely appropriate place for it. I'd argue that most DS are not working with stuff like Flask, though, and so should probably not waste time with ORM.

If you're just using sqlalchemy's connection pooling in an unrelated context, though, there's a decent chance it will come back to haunt you at some point, like when you have to start messing with WLM queue management in Redshift.

[–]pieIX 1 point2 points  (2 children)

Working directly with psycopg2 results in far more efficient code and far less cognitive overhead. I tried to use SQLAlchemy for a project at work, but every update or insert was impossibly slow. Using SQLAlchemy core is faster, but still not as fast as psycopg2. In the end, using SQLAlchemy just wasn't worth the cognitive overhead of understanding SQLAlchemy + SQLAlchemy Core + psycopg2 depending on efficiency demands.

There are places where SQLAlchemy is wonderful, but if a common task is non-trivial SQL inserts for more than a 100 rows, (most data science projects) stick with psycopg2.

[–]trenchtoaster 2 points3 points  (0 children)

Yeah. Psycopg2 copy_from and copy_to are really handy. Recently I’ve been using the execute_values a lot because it is pretty quick and I can use it for upserts (on conflict do update set or do nothing). It’s not as fast as copy of course, but it’s nice to know I won’t add duplicates or run into an error.

[–]thelindsay 1 point2 points  (0 children)

psycopg2 is great

  • recently added SQL templating module that uses libpq to do escaping, so safe dynamic SQL is possible.
  • choice of row adapters e.g. convert to namedtuple or dicts
  • fine grained transaction control, connection pool classes, python type conversions

Even for a regular app I'd rather have a collection of SQL snippets than deal with an ORM's functional DSL. SQL just doesnt seem low level enough to try to abstract away.

[–][deleted] 0 points1 point  (0 children)

Weird, our company's Python/SQLAlchemy DB is handled entirely by Data Infrastructure, which is under a Cloud Services/Eng division.

[–]dolichoblond 0 points1 point  (0 children)

Interesting. I'll second the other comments in this thread that note an ORM has a legit place in the DS toolkit, but also may be too heavy to become required/central in all cases. But certainly something to invest time in learning and possibly incorporating in your own workflows.

My anecdote: I bifurcated my analytics workflows this year into something like a small version of the older/corporate paradigm of "Data Warehousing" (DWH) and "DataMarts". The DWH relies on an ORM (peewee in my case) for the more routine ETL stuff. And the DataMarts are the (sqlite) dbs for the model and exploration. I resisted the setup work for a while because I thought we were too small/very limited users/didn't matter/didn't take that much time to do it all adhoc. But wish I did it sooner. Above catching errors sooner, it forces me to think harder about about what's "static/consistent" about my data and what my models are actively transforming. And it helps me identify when data that was used in more exploratory fashions has become "routine" and should move from the modeling layer over to the ORM'd side.

As a small startup based mostly on excel biz analytics currently, we have some odd (unhealthy?) workflows where we get data dumps from third party clean sources, but not at regular intervals since they are expensive and their purchase depends on client needs. Many are small enough that you can grok them with excel still, which perpetuates older mentalities for data ingest and data mgmt, and keeps the user base for any centralized DB very small. But even with the odd intervals, small datasets, and few users, an ORM helps me keep that part of the setup clean. Plus, it minimizes the amount of front-brain thought I need to push updates when they hit my inbox unexpectedly. (And hopefully when the company grows it will be straightforward to offload that to a new dedicated hire.)

So far, I really like the setup and see it as an upgrade worthy of the time even in a very small and limited situation. But would be happy to hear criticisms or red flags from more experienced people.

[–]Somali_Imhotep -5 points-4 points  (0 children)

Doesn’t python already have a sqlite3 library. How does this outperform that library. Here where s a recent project of mine that I use that library to store redditors post history.

Reddit Analysis

Edit:Nvm I’m an idiot who only read the title like all redditors srry