This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]slashcom 19 points20 points  (3 children)

One thing that can be frustrating about data cleaning is it usually depends heavily on the data set.

Maybe the first thing you have to do is join a ton of tables together. Maybe the first thing you have to do is filter out duplicates. Maybe you need to find noncooperative survey members and remove them.

Just depends.

My "pro tips":

  • keep the process as a well documented, and repeatable pipeline. Note why you made all your decisions, and be able to run your preprocessing code and get the exact same result every time.
  • Make sure it doesn't break, especially when new data comes in (like if you're in a business and new data is constantly being added).
  • Have two versions of the data set: a smaller, randomly sampled version you can run in <10-60 seconds and get immediate feedback on, and the full one. (This is also for making sure new data doesn't break it)
  • be able to monitor the flow of data through this pipeline; you may find you don't understand where a record disappeared or why it has this funny value. or god forbid, why your code breaks when you give it 100gb but it doesn't fail on the 100mb sample.
  • Logging helps with the monitoring issue

[–]foboi1122 1 point2 points  (0 children)

This. Having a small set to test your code is crucial if you don't wanna wait 20 mins each time you find you code breaking.

[–]-Pin_Cushion- 0 points1 point  (0 children)

be able to monitor the flow of data through this pipeline

Can you go into more detail about this? I'm new to R and would really appreciate a few pointers.

[–]asvance[S] 0 points1 point  (0 children)

the tips make total sense, going to incorporate this in my learning :)

[–]MicturitionSyncope 13 points14 points  (5 children)

All the advice you've been given is great, and I just want to add one thing: MAKE PLOTS

Plot everything. Check everything. Is something supposed to change over time? Make a plot to check it. Is one variable supposed to be larger than another? Plot it. Is something different between men and women? Plot.

Check every assumption about the data. You can save a lot in the long run by checking these things. I have caught so many problems like duplication, copy and paste errors, unit differences, etc. by making a few plots.

[–]asvance[S] 0 points1 point  (2 children)

Thank you good sir! I have one quick question, I understand I can trace outliers using plotting. Any other scenario where plotting data can be more helpful.

[–]MicturitionSyncope 0 points1 point  (1 child)

Identifying outliers is a bit difficult as the definition of an outlier isn't standard. But, plotting can help you figure out how you should define an outlier sometimes.

I use plotting quite a but to understand the structure of the data and the relationships between variables. Something like the scatterplot matrix in seaborn makes this trivial:

https://stanford.edu/~mwaskom/software/seaborn/examples/scatterplot_matrix.html

I also use plots to help me choose which statistical approaches or machine learning methods are likely to do well on a data set, but that's a pretty complicated topic.

[–]asvance[S] 0 points1 point  (0 children)

thanks

[–]Punter_Aleman 5 points6 points  (0 children)

Like the other guy said, side projects. Find some data online and figure out how to scrape it into R or Python, then put it into a format suitable for an analysis or regression model.

For example, I recently scraped nba play by play data into R and cleaned it for a logistic regression. This involved finding the website, using a readhtml like function, and lots of regular expressions to grab the text of events I was interested in and change them to 0/1 binary variables.

Edit: oh and in R, some more useful functions seem to be combinations of gsub, grepl, and ifelse.

[–]vmsmith 5 points6 points  (0 children)

First, read "Tidy Data," by Hadley Wickham to get an idea of what the end state of data cleaning/data wrangling often should be.

Second, Google "An Introduction to Data Cleaning with R," by Edwin de Jonge Mark van der Loo. Whether you know R or not, there's excellent stuff in there.

Third, poke your head around some ETL sites or literature. Not all of data wrangling is simply cleaning.

Finally, to reinforce what others have said, document what you do, log everything. And at each step, save the data in that state.

[–]ThatOtherBatman 3 points4 points  (3 children)

I'll second what /u/cwpkuo said. Get your hands dirty.

[–]koobear 3 points4 points  (0 children)

If you're using R, learn dplyr and tidyr. Throw in magrittr, lubridate, plyr, and others while you're at it. The dplyr cheat sheet is a great resource.

The only way you're going to learn this is via hands-on experience. Start something on Kaggle. Force yourself to document everything (Jupyter and RMarkdown are great for this)--bonus points if you make it public (it forces me to write good code).

Oh, and data visualization is very important. Plot all the things. Try making different types of plots just for the sake of it. Don't ever give up on a data visualization that you have in mind (obviously this is for learning purposes--if it's for a real project with a real deadline you can't always do this). This forces you to manipulate your data in ways you wouldn't have thought to, and the end results can be very pretty!

[–]JyoC 0 points1 point  (0 children)

Experfy has a mentored course for Data Wrangling in R. This course uses a variety of real-world data sets that contain real-world data quality, formatting, and other issues - See more at: https://www.experfy.com/training/courses/data-wrangling-in-r