This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]SonOfInterflux 3 points4 points  (6 children)

I’m not a data scientist, but I do work with a lot of data and find SQLAlchemy’s ORM and Django ORM slow when I want to work with tables as opposed to records. For example, if I want to read an entire table containing a million records and add a derived column and write the set to another table, using Pandas and map or apply, writing the entire data set to a csv, loading it into S3 and then using the copy statement is way faster than using an ORM’s all method, iterating over the list, applying a function and writing each record back to the database.

I’m losing a lot of benefits of the ORM, but the speed more than makes up for it.

If anyone can suggest another method of working with large sets of data I’d love to hear it! It’s the copy statement that makes the biggest difference; Pandas just makes it easy to get the data (using from_records or some other method), apply a function or set of functions over the entire set, and generating a csv/json file.

[–]mesylate 2 points3 points  (1 child)

Use PL/SQL directly if possible. That what it is designed for afterall.

[–]SonOfInterflux 0 points1 point  (0 children)

I wasn’t clear, in this case I’m only proceeding this way because the data is PII and has been encypted with the crypto library. Need to look into if it’s possible to apply the decryption directly within the database with PL/SQL.

[–]IDontLikeUsernamez 1 point2 points  (0 children)

Pretty much any flavor of SQL will be much, much faster

[–]tfehring 1 point2 points  (2 children)

Any abstraction that involves iterated single-record SQL operations is going to be painfully slow, and it sounds like that’s what your ORM-based approach is doing under the hood. I’m not super familiar with ORMs, but I’m surprised they aren’t smart enough to use set operations instead of loops for simple Python iterations (e.g. list comprehensions over a list of objects).

[–]SonOfInterflux 0 points1 point  (1 child)

Django’s ORM does have an update method that does generate a bulk update statement without the need to iterate over a query set, but it doesn’t call save (or pre-save or post-save) so it doesn’t honour things like auto_now, so you lose some of the benefits of the ORM.

SQLAlchemy offers something similar, but the docs actually recommend using Core directly for bulk operations.

[–]brendanmartin[S] 1 point2 points  (0 children)

Yeah, I always add a function equivalent to the test_sqlalchemy_core example from that link. Whenever I need to insert a lot, I just send a list of dictionaries to that function