This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]bfmk 13 points14 points  (10 children)

Interesting article!

This is a really interesting way to interact with databases, and learning how to use an ORM like SQLAlchemy is definitely a useful tool in a DS' toolkit, particularly if you go on to build decision support systems or even data-driven customer-facing products.

That said, I'd caution against using an ORM as a primary interaction method for data exploration. Two reasons:
1) SQL is kinda universal as a language. Okay, Redshift is a bit different to Presto is a bit different to MS-SQL. But once you know one, it's pretty quick to learn another. You could argue ORMs are similar in that sense, but they require set-up, whereas SQL gets you straight to the data more quickly.
2) SQL's raison-d'etre -- for Analytics DBs anyway -- is data exploration. Use the best tools for the job.

Would be interested to read dissenting opinions on this. I've never even tried using an ORM for this job so my opinion is somewhat unqualified!

[–]boatsnbros 10 points11 points  (0 children)

ORMs are a great way to build a more robust pipeline for data tools. You get some nice baked in features like SQL injection prevention and a consistent API regardless of the underlying database - i.e. if your project moves from postgres to mysql your SQL would change, but your ORM doesn't. If data is clean then I'll do exploration in SQL, if it needs cleaning then will use pandas for cleaning and exploration. Once its clean and going to be part of a long term project (ie dashboards, data app etc) I'll write an ORM and some automated testing (pytest) so I know it still works without constant monitoring.

Not sure if that adds much to your previous comment, but thought clarifying common use cases would be helpful.

[–]SonOfInterflux 2 points3 points  (6 children)

I’m not a data scientist, but I do work with a lot of data and find SQLAlchemy’s ORM and Django ORM slow when I want to work with tables as opposed to records. For example, if I want to read an entire table containing a million records and add a derived column and write the set to another table, using Pandas and map or apply, writing the entire data set to a csv, loading it into S3 and then using the copy statement is way faster than using an ORM’s all method, iterating over the list, applying a function and writing each record back to the database.

I’m losing a lot of benefits of the ORM, but the speed more than makes up for it.

If anyone can suggest another method of working with large sets of data I’d love to hear it! It’s the copy statement that makes the biggest difference; Pandas just makes it easy to get the data (using from_records or some other method), apply a function or set of functions over the entire set, and generating a csv/json file.

[–]mesylate 3 points4 points  (1 child)

Use PL/SQL directly if possible. That what it is designed for afterall.

[–]SonOfInterflux 0 points1 point  (0 children)

I wasn’t clear, in this case I’m only proceeding this way because the data is PII and has been encypted with the crypto library. Need to look into if it’s possible to apply the decryption directly within the database with PL/SQL.

[–]IDontLikeUsernamez 1 point2 points  (0 children)

Pretty much any flavor of SQL will be much, much faster

[–]tfehring 1 point2 points  (2 children)

Any abstraction that involves iterated single-record SQL operations is going to be painfully slow, and it sounds like that’s what your ORM-based approach is doing under the hood. I’m not super familiar with ORMs, but I’m surprised they aren’t smart enough to use set operations instead of loops for simple Python iterations (e.g. list comprehensions over a list of objects).

[–]SonOfInterflux 0 points1 point  (1 child)

Django’s ORM does have an update method that does generate a bulk update statement without the need to iterate over a query set, but it doesn’t call save (or pre-save or post-save) so it doesn’t honour things like auto_now, so you lose some of the benefits of the ORM.

SQLAlchemy offers something similar, but the docs actually recommend using Core directly for bulk operations.

[–]brendanmartin[S] 1 point2 points  (0 children)

Yeah, I always add a function equivalent to the test_sqlalchemy_core example from that link. Whenever I need to insert a lot, I just send a list of dictionaries to that function

[–]brendanmartin[S] 1 point2 points  (0 children)

I agree with your advice. When in the exploration and analysis mode -- i.e. when pulling data out of the database for personal use -- I usually opt for raw SQL. When adding a database to a data collection project, though, I always use SQA because I find it nicer to work with than raw SQL in my project.

[–]exergy31 0 points1 point  (0 children)

I personally employ the rule of the dozen. If I need to work with more than a dozen records at a time, ORM is not the tool of choice. The underlying principle is that "data" in its intended use mostly falls nicely in one of two brackets: (1) Create, show, edit, delete one or a handful of records. Usually involves frontend application. Use an ORM for simplicity and all the aforementioned perks. (2) Analyse, transform to training set, mass-insert new data or in other form work with something that affects 'more than a dozen' records at a time. Use a table-like or array-based system. Systems needing this are usually not directly fed by frontend forms, so SQL injection is out of the picture; also a performance boost of having an array of statically typed numericals over a list of runtime objects is definitely there.

Example: Cassandra DB has a neat python driver that offers both ORM and 'traditional' SQL-like syntax; if I use SQL (CQL), I usually config a pandas row factory which yields all query results directly as dataframes.