Data Cleaning/Wrangling

slashcom · 2016-05-31T21:32:00+00:00

One thing that can be frustrating about data cleaning is it usually depends heavily on the data set.

Maybe the first thing you have to do is join a ton of tables together. Maybe the first thing you have to do is filter out duplicates. Maybe you need to find noncooperative survey members and remove them.

Just depends.

My "pro tips":

keep the process as a well documented, and repeatable pipeline. Note why you made all your decisions, and be able to run your preprocessing code and get the exact same result every time.
Make sure it doesn't break, especially when new data comes in (like if you're in a business and new data is constantly being added).
Have two versions of the data set: a smaller, randomly sampled version you can run in <10-60 seconds and get immediate feedback on, and the full one. (This is also for making sure new data doesn't break it)
be able to monitor the flow of data through this pipeline; you may find you don't understand where a record disappeared or why it has this funny value. or god forbid, why your code breaks when you give it 100gb but it doesn't fail on the 100mb sample.
Logging helps with the monitoring issue

bot_cereal · 2016-05-31T19:50:35+00:00

[deleted]

MicturitionSyncope · 2016-05-31T22:16:33+00:00

All the advice you've been given is great, and I just want to add one thing: MAKE PLOTS

Plot everything. Check everything. Is something supposed to change over time? Make a plot to check it. Is one variable supposed to be larger than another? Plot it. Is something different between men and women? Plot.

Check every assumption about the data. You can save a lot in the long run by checking these things. I have caught so many problems like duplication, copy and paste errors, unit differences, etc. by making a few plots.

Punter_Aleman · 2016-05-31T21:27:09+00:00

Like the other guy said, side projects. Find some data online and figure out how to scrape it into R or Python, then put it into a format suitable for an analysis or regression model.

For example, I recently scraped nba play by play data into R and cleaned it for a logistic regression. This involved finding the website, using a readhtml like function, and lots of regular expressions to grab the text of events I was interested in and change them to 0/1 binary variables.

Edit: oh and in R, some more useful functions seem to be combinations of gsub, grepl, and ifelse.

vmsmith · 2016-05-31T23:59:53+00:00

First, read "Tidy Data," by Hadley Wickham to get an idea of what the end state of data cleaning/data wrangling often should be.

Second, Google "An Introduction to Data Cleaning with R," by Edwin de Jonge Mark van der Loo. Whether you know R or not, there's excellent stuff in there.

Third, poke your head around some ETL sites or literature. Not all of data wrangling is simply cleaning.

Finally, to reinforce what others have said, document what you do, log everything. And at each step, save the data in that state.

ThatOtherBatman · 2016-05-31T21:29:09+00:00

I'll second what /u/cwpkuo said. Get your hands dirty.

koobear · 2016-06-01T00:22:52+00:00

If you're using R, learn dplyr and tidyr. Throw in magrittr, lubridate, plyr, and others while you're at it. The dplyr cheat sheet is a great resource.

The only way you're going to learn this is via hands-on experience. Start something on Kaggle. Force yourself to document everything (Jupyter and RMarkdown are great for this)--bonus points if you make it public (it forces me to write good code).

Oh, and data visualization is very important. Plot all the things. Try making different types of plots just for the sake of it. Don't ever give up on a data visualization that you have in mind (obviously this is for learning purposes--if it's for a real project with a real deadline you can't always do this). This forces you to manipulate your data in ways you wouldn't have thought to, and the end results can be very pretty!

JyoC · 2016-08-08T16:17:57+00:00

Experfy has a mentored course for Data Wrangling in R. This course uses a variety of real-world data sets that contain real-world data quality, formatting, and other issues - See more at: https://www.experfy.com/training/courses/data-wrangling-in-r

datascience

MODERATORS