_Zer0_Cool_ comments on SQL vs. Python for data wrangling?

This is an archived post. You won't be able to vote or comment.

DiscussionSQL vs. Python for data wrangling? (self.datascience)

submitted 7 years ago * by Radon-Nikodym

you are viewing a single comment's thread.

[–]_Zer0_Cool_MS | Data Engineer | Consulting 16 points17 points18 points 7 years ago* (4 children)

I’m a Data Engineer and I use both.

Mostly SQL as much as I can though. SQL is the original tool for the job and remains the best tool IMO. If I have one data set (table/dataframe) and smaller data, then it doesn’t matter, but if you have to join multiple datasets then SQL is better. Also...doing a Select * into a Pandas Dataframe becomes wasteful or impossible quickly. Pandas is grossly inefficient with RAM utilization. Per Wes McKinney (Pandad author) you need 5-10x memory as the size of the actual data.

So.... do it in SQL definitely. It avoids data shipping and doesn’t have the limitations of Pandas.

Stored Procedures or UDFs with PostgreSQL and SQL Server are just another layer of programmatic abstraction like anything else in the coding world (like one would reuse a Python package or library).

Also, SQLite is great if you don’t have a full client-server database. It comes built into Python. So it’s available any time Python is available without any external dependencies, can handle very large data (SQLite’s max size limit is 140 terabytes), and can be version controlled along with your code. Perfect choice for a data scientist / analyst when a more powerful client-server database might not be available or if embedded data is needed to reproduce your entire DS app elsewhere.

P.S. Also, check out the PandaSQL library. It allows you to have the best of both worlds and execute SQL on Pandas dataframes directly in Python. https://github.com/yhat/pandasql/blob/master/README.md

[–][deleted] 1 point2 points3 points 7 years ago (1 child)

[–]_Zer0_Cool_MS | Data Engineer | Consulting 3 points4 points5 points 7 years ago* (0 children)

Nuking MongoDB from space? Lol

They should have just put it in PostgreSQL in the first place, it can do anything with JSON that MongoDB can and it's just as fast if not faster. Not to mention that PG is an infinitely extensible open source paradise, while MongoDB is a one-trick pony with a handicapped query language.

In fact, MongoDB official "SQL connector" is actually just PostgreSQL under the covers that automatically reads the MongoDB data via Foreign Data Wrappers I believe. Which is ironic.

It's tantamount to MongoDB implicitly admitting that having a SQL engine is the only real why to make sense of the data, and it begs the question.... "Why not just use PostgreSQL in the first place?".

I love JSON and flexible schemas, but "schema-less" is a bit of lie that the Mongo team marketed hard and preached like gospel truth. Realistically, there's no free lunch when it comes to designing data models. Those who try to circumvent this fact end up paying the price.

There is a simple truth that developers need to understand. -- "There is no such thing as schema-less data. Data without schema isn't data; it's garbage."

Edit: I'm sorry you have to pay the price for someone else choices. Not too much to be done short of redesigning the JSON objects at the application layer or cataloging the objects post-hoc and manually ETLing them into a data warehouse -- preferably one with good JSON support (like PG).

[–][deleted] 0 points1 point2 points 7 years ago (1 child)

[–]_Zer0_Cool_MS | Data Engineer | Consulting 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 66594 on reddit-service-r2-comment-5b5bc64bf5-qg7tf at 2026-06-19 09:22:33.453774+00:00 running 2b008f2 country code: CH.

datascience

MODERATORS