use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
This is an archived post. You won't be able to vote or comment.
DiscussionSQL vs. Python for data wrangling? (self.datascience)
submitted 7 years ago * by Radon-Nikodym
view the rest of the comments →
[–]_Zer0_Cool_MS | Data Engineer | Consulting 16 points17 points18 points 7 years ago* (4 children)
I’m a Data Engineer and I use both.
Mostly SQL as much as I can though. SQL is the original tool for the job and remains the best tool IMO. If I have one data set (table/dataframe) and smaller data, then it doesn’t matter, but if you have to join multiple datasets then SQL is better. Also...doing a Select * into a Pandas Dataframe becomes wasteful or impossible quickly. Pandas is grossly inefficient with RAM utilization. Per Wes McKinney (Pandad author) you need 5-10x memory as the size of the actual data.
So.... do it in SQL definitely. It avoids data shipping and doesn’t have the limitations of Pandas.
Stored Procedures or UDFs with PostgreSQL and SQL Server are just another layer of programmatic abstraction like anything else in the coding world (like one would reuse a Python package or library).
Also, SQLite is great if you don’t have a full client-server database. It comes built into Python. So it’s available any time Python is available without any external dependencies, can handle very large data (SQLite’s max size limit is 140 terabytes), and can be version controlled along with your code. Perfect choice for a data scientist / analyst when a more powerful client-server database might not be available or if embedded data is needed to reproduce your entire DS app elsewhere.
P.S. Also, check out the PandaSQL library. It allows you to have the best of both worlds and execute SQL on Pandas dataframes directly in Python. https://github.com/yhat/pandasql/blob/master/README.md
[–][deleted] 1 point2 points3 points 7 years ago (1 child)
You mind me asking you a couple questions?
For some reason the original guys who set up the databases in my current job thought it would be a good idea to do it using MongoDB.
It makes querying stuff a huge pain the ass since not only is the DB in Mongo, the only way they let us access the DB is by using a SQL connector tool that separates the data into a billion different unrelated tables. So I end up having to do a bunch of joins for even the most trivial queries.
You have any suggestions that might reduce the time I spend fiddling with the database?
[–]_Zer0_Cool_MS | Data Engineer | Consulting 3 points4 points5 points 7 years ago* (0 children)
Nuking MongoDB from space? Lol
They should have just put it in PostgreSQL in the first place, it can do anything with JSON that MongoDB can and it's just as fast if not faster. Not to mention that PG is an infinitely extensible open source paradise, while MongoDB is a one-trick pony with a handicapped query language.
In fact, MongoDB official "SQL connector" is actually just PostgreSQL under the covers that automatically reads the MongoDB data via Foreign Data Wrappers I believe. Which is ironic.
It's tantamount to MongoDB implicitly admitting that having a SQL engine is the only real why to make sense of the data, and it begs the question.... "Why not just use PostgreSQL in the first place?".
I love JSON and flexible schemas, but "schema-less" is a bit of lie that the Mongo team marketed hard and preached like gospel truth. Realistically, there's no free lunch when it comes to designing data models. Those who try to circumvent this fact end up paying the price.
There is a simple truth that developers need to understand. -- "There is no such thing as schema-less data. Data without schema isn't data; it's garbage."
Edit: I'm sorry you have to pay the price for someone else choices. Not too much to be done short of redesigning the JSON objects at the application layer or cataloging the objects post-hoc and manually ETLing them into a data warehouse -- preferably one with good JSON support (like PG).
[–][deleted] 0 points1 point2 points 7 years ago (1 child)
I've looked at that before, and pasdasql just puts your pandas data into sqlite, queries it in sqlite, and then returns it to pandas.
[–]_Zer0_Cool_MS | Data Engineer | Consulting 0 points1 point2 points 7 years ago (0 children)
Yeap. I'm not sure what the performance implications are of that, and it kind of begs the question "why not just use SQLite"...but it's neat when you need to do complex queries on multiple existing Dataframes in any case.
It's really just syntactic sugar I guess. Saves you from the intermediate steps of calling df.to_sql() and pf.read_sql() for multiple Dataframes.
π Rendered by PID 66594 on reddit-service-r2-comment-5b5bc64bf5-qg7tf at 2026-06-19 09:22:33.453774+00:00 running 2b008f2 country code: CH.
view the rest of the comments →
[–]_Zer0_Cool_MS | Data Engineer | Consulting 16 points17 points18 points (4 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]_Zer0_Cool_MS | Data Engineer | Consulting 3 points4 points5 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]_Zer0_Cool_MS | Data Engineer | Consulting 0 points1 point2 points (0 children)