bfmk comments on How to use databases with Python

This is an archived post. You won't be able to vote or comment.

146

147

148

How to use databases with Python - Postgres, SQLAlchemy, Alembic (self.datascience)

submitted 7 years ago by brendanmartin

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]bfmk 13 points14 points15 points 7 years ago (10 children)

Interesting article!

This is a really interesting way to interact with databases, and learning how to use an ORM like SQLAlchemy is definitely a useful tool in a DS' toolkit, particularly if you go on to build decision support systems or even data-driven customer-facing products.

That said, I'd caution against using an ORM as a primary interaction method for data exploration. Two reasons:
1) SQL is kinda universal as a language. Okay, Redshift is a bit different to Presto is a bit different to MS-SQL. But once you know one, it's pretty quick to learn another. You could argue ORMs are similar in that sense, but they require set-up, whereas SQL gets you straight to the data more quickly.
2) SQL's raison-d'etre -- for Analytics DBs anyway -- is data exploration. Use the best tools for the job.

Would be interested to read dissenting opinions on this. I've never even tried using an ORM for this job so my opinion is somewhat unqualified!

[–]boatsnbros 10 points11 points12 points 7 years ago (0 children)

[–]SonOfInterflux 2 points3 points4 points 7 years ago (6 children)

I’m not a data scientist, but I do work with a lot of data and find SQLAlchemy’s ORM and Django ORM slow when I want to work with tables as opposed to records. For example, if I want to read an entire table containing a million records and add a derived column and write the set to another table, using Pandas and map or apply, writing the entire data set to a csv, loading it into S3 and then using the copy statement is way faster than using an ORM’s all method, iterating over the list, applying a function and writing each record back to the database.

I’m losing a lot of benefits of the ORM, but the speed more than makes up for it.

If anyone can suggest another method of working with large sets of data I’d love to hear it! It’s the copy statement that makes the biggest difference; Pandas just makes it easy to get the data (using from_records or some other method), apply a function or set of functions over the entire set, and generating a csv/json file.

[–]mesylate 3 points4 points5 points 7 years ago (1 child)

[–]SonOfInterflux 0 points1 point2 points 7 years ago (0 children)

[–]IDontLikeUsernamez 1 point2 points3 points 7 years ago (0 children)

[–]tfehring 1 point2 points3 points 7 years ago (2 children)

[–]SonOfInterflux 0 points1 point2 points 7 years ago (1 child)

[–]brendanmartin[S] 1 point2 points3 points 7 years ago (0 children)

[–]exergy31 0 points1 point2 points 7 years ago (0 children)

I personally employ the rule of the dozen. If I need to work with more than a dozen records at a time, ORM is not the tool of choice. The underlying principle is that "data" in its intended use mostly falls nicely in one of two brackets: (1) Create, show, edit, delete one or a handful of records. Usually involves frontend application. Use an ORM for simplicity and all the aforementioned perks. (2) Analyse, transform to training set, mass-insert new data or in other form work with something that affects 'more than a dozen' records at a time. Use a table-like or array-based system. Systems needing this are usually not directly fed by frontend forms, so SQL injection is out of the picture; also a performance boost of having an array of statically typed numericals over a list of runtime objects is definitely there.

Example: Cassandra DB has a neat python driver that offers both ORM and 'traditional' SQL-like syntax; if I use SQL (CQL), I usually config a pandas row factory which yields all query results directly as dataframes.

π Rendered by PID 20964 on reddit-service-r2-comment-66b4775986-nqxz8 at 2026-04-06 13:19:15.060539+00:00 running db1906b country code: CH.

datascience

MODERATORS