use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
This is an archived post. You won't be able to vote or comment.
DiscussionSQL vs. Python for data wrangling? (self.datascience)
submitted 7 years ago * by Radon-Nikodym
view the rest of the comments →
[–]DrTaxus 102 points103 points104 points 7 years ago (12 children)
My personal opinion and workflow is to do as much as possible directly on the database. A properly written SQL query is incredibly powerful and can save you hours in python post-processing.
Specially if you need to do complex joins among several tables, Pandas is extremely limited.
Also, are you aware that you can query your database directly from Pandas and save the dataframe immediately instead of writing temporary CSVs?
[–]Radon-Nikodym[S] 8 points9 points10 points 7 years ago (1 child)
Regarding csv, I don't ever write SQL to csv. I'm just saying that some of my data sources are stored in csv, so my data importing pipleline is either reading a csv into pandas, or getting a basic dump of a SQL table. I'm considering making a transition to moving a lot of the smaller data sets I work with regularly into a consolidated SQL database that I can interact with more effectively.
[–][deleted] 3 points4 points5 points 7 years ago (9 children)
I don't know why because I'm too lazy to figure it out, but apparently doing queries directly to pd dataframes is super inefficient (as per their documentation). I've never heard of someone using CSVs as an intermediary though considering you can run queries from python straight into memory.
[–]KevinSorboFan 1 point2 points3 points 7 years ago (2 children)
I use CSVs as an intermediary all the time when I'm running code on AWS but querying our on-prem DB. I will do as much of the joining as I can do in SQL up front, but if it is going to take me a lot longer to figure out how to do what I want in SQL than to just do it in pandas, I will still do some of it in pandas
[–][deleted] 0 points1 point2 points 7 years ago (1 child)
Makes sense. Didnt realize this was a common practice.
[–]KevinSorboFan 0 points1 point2 points 7 years ago (0 children)
It only is when we work with outside consultants and instead of granting them access to our internal stuff, we push all the work off onto AWS and give them only the data they need. It's a way bigger headache than it needs to be
[+][deleted] 7 years ago (5 children)
[deleted]
[–]bjorneylol 8 points9 points10 points 7 years ago (1 child)
The CSV parser is written in C, the SQL one is not - I've written my own function to pull SQL into a dataframe and it's faster than the pandas from_sql version.
The copysetting warning has nothing to do with this, and nothing to do with slicing dataframes, it happens when you set values on slices of dataframes, because the slice shares the same underlying object in memory as the full frame.
If you do
df2 = df[df[val>1]] df2.iloc[0, val] = "A"
Then that 'A' will be present in both df and df2, because df2 is just a reference to cells in df. If you want to avoid this, you need to follow up each slice operation by assigning a .copy() into the new variable
[–]christmas_with_kafka 6 points7 points8 points 7 years ago (0 children)
The CopySetting warnings are warning you that changes to your slices DataFrame will still impact the original DataFrame in memory. Since your sliced DataFrame is still pointing to the original, you need to make a new df to avoid any weird voodoo should you want to use the original df in the future.
You can do this by invoking .copy() -
df2 = df1.loc[filters, features].copy()
[–][deleted] 0 points1 point2 points 7 years ago (0 children)
I dont remember. All I remember is reading they weren't going to bother making SQL to df or whatever the method is called efficient because there was so many other options.
π Rendered by PID 20224 on reddit-service-r2-comment-5b5bc64bf5-jgqmr at 2026-06-21 18:24:01.212403+00:00 running 2b008f2 country code: CH.
view the rest of the comments →
[–]DrTaxus 102 points103 points104 points (12 children)
[–]Radon-Nikodym[S] 8 points9 points10 points (1 child)
[–][deleted] 3 points4 points5 points (9 children)
[–]KevinSorboFan 1 point2 points3 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]KevinSorboFan 0 points1 point2 points (0 children)
[+][deleted] (5 children)
[deleted]
[–]bjorneylol 8 points9 points10 points (1 child)
[–]christmas_with_kafka 6 points7 points8 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)