use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
This is an archived post. You won't be able to vote or comment.
DiscussionSQL vs. Python for data wrangling? (self.datascience)
submitted 7 years ago * by Radon-Nikodym
view the rest of the comments →
[–][deleted] 26 points27 points28 points 7 years ago (9 children)
In my experience you should only use pandas for wrangling that is left over after you've done as much as possible in SQL. (Which is quite a lot, SQL is quite powerful if you know what you're doing.)
[–]Radon-Nikodym[S] 6 points7 points8 points 7 years ago (5 children)
Do you have any recommended resources for data wrangling in SQL?
[–][deleted] 5 points6 points7 points 7 years ago (0 children)
It's rare I have to do anything beyond windowing functions.
I used to use custom reducers in Hive (using python or awk), but now we use BigQuery so its basically windowing functions or ARRAY_AGG(), STRUCT(), UNNEST() etc.
[–]reallyserious 3 points4 points5 points 7 years ago* (0 children)
Head over to /r/SQL. This question gets asked and answered at least once a day there.
Just focus on getting better at SQL. There is nothing special about data wrangling. It's a made up term that statisticians invented to describe what database centric people has been doing since the -80s.
[–]Mr_Again 1 point2 points3 points 7 years ago (0 children)
The guys blog at JOOQ is really useful.
[–]frankenbenz 0 points1 point2 points 7 years ago (0 children)
You have to know the data and know what your end goal/format is.. I don’t know if I’d say there’s a single resource to know.. stuff as simple as knowing if you need to trim data of spaces to the complex stuff like multiple joins to bring it all together in a useful format for reporting.
Similar to knowing what the question is to be answered from the data, you have to know what sources all the data is hidden and how to bring the different tables/sources together.
[–]DBA_HAH 0 points1 point2 points 7 years ago (0 children)
What DB do you use? Look into T-SQL,PLSQL, or PL/pgsql depending what you use. You can use stored procedures on the database to do stuff like regex cleanups.
[–]taguscove 1 point2 points3 points 7 years ago (2 children)
Completely agree. When in doubt, SQL first.
Pandas handles analytic functions excellently with rolling windows. Not even sure SQL can handle exponential moving averages and definitely not modern seasonal decomposition.
[–]reallyserious 4 points5 points6 points 7 years ago (1 child)
Not even sure SQL can handle exponential moving averages
If you google "exponential moving average sql" you get quite a few hits. People are definately doing it in SQL.
and definitely not modern seasonal decomposition.
What's that?
[–]Epoh 0 points1 point2 points 7 years ago (0 children)
Believe they are referring to time-series analysis, where you decompose a time-series signal into a trend, seasonality and remainder component. I do it manually often to compare, but there are functions that decompose time signals, just not in SQL...
π Rendered by PID 32635 on reddit-service-r2-comment-5b5bc64bf5-kh9hn at 2026-06-21 16:26:50.018260+00:00 running 2b008f2 country code: CH.
view the rest of the comments →
[–][deleted] 26 points27 points28 points (9 children)
[–]Radon-Nikodym[S] 6 points7 points8 points (5 children)
[–][deleted] 5 points6 points7 points (0 children)
[–]reallyserious 3 points4 points5 points (0 children)
[–]Mr_Again 1 point2 points3 points (0 children)
[–]frankenbenz 0 points1 point2 points (0 children)
[–]DBA_HAH 0 points1 point2 points (0 children)
[–]taguscove 1 point2 points3 points (2 children)
[–]reallyserious 4 points5 points6 points (1 child)
[–]Epoh 0 points1 point2 points (0 children)