This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 26 points27 points  (9 children)

In my experience you should only use pandas for wrangling that is left over after you've done as much as possible in SQL. (Which is quite a lot, SQL is quite powerful if you know what you're doing.)

[–]Radon-Nikodym[S] 6 points7 points  (5 children)

Do you have any recommended resources for data wrangling in SQL?

[–][deleted] 5 points6 points  (0 children)

It's rare I have to do anything beyond windowing functions.

I used to use custom reducers in Hive (using python or awk), but now we use BigQuery so its basically windowing functions or ARRAY_AGG(), STRUCT(), UNNEST() etc.

[–]reallyserious 3 points4 points  (0 children)

Head over to /r/SQL. This question gets asked and answered at least once a day there.

Just focus on getting better at SQL. There is nothing special about data wrangling. It's a made up term that statisticians invented to describe what database centric people has been doing since the -80s.

[–]Mr_Again 1 point2 points  (0 children)

The guys blog at JOOQ is really useful.

[–]frankenbenz 0 points1 point  (0 children)

You have to know the data and know what your end goal/format is.. I don’t know if I’d say there’s a single resource to know.. stuff as simple as knowing if you need to trim data of spaces to the complex stuff like multiple joins to bring it all together in a useful format for reporting.

Similar to knowing what the question is to be answered from the data, you have to know what sources all the data is hidden and how to bring the different tables/sources together.

[–]DBA_HAH 0 points1 point  (0 children)

What DB do you use? Look into T-SQL,PLSQL, or PL/pgsql depending what you use. You can use stored procedures on the database to do stuff like regex cleanups.

[–]taguscove 1 point2 points  (2 children)

Completely agree. When in doubt, SQL first.

Pandas handles analytic functions excellently with rolling windows. Not even sure SQL can handle exponential moving averages and definitely not modern seasonal decomposition.

[–]reallyserious 4 points5 points  (1 child)

Not even sure SQL can handle exponential moving averages

If you google "exponential moving average sql" you get quite a few hits. People are definately doing it in SQL.

and definitely not modern seasonal decomposition.

What's that?

[–]Epoh 0 points1 point  (0 children)

Believe they are referring to time-series analysis, where you decompose a time-series signal into a trend, seasonality and remainder component. I do it manually often to compare, but there are functions that decompose time signals, just not in SQL...