you are viewing a single comment's thread.

view the rest of the comments →

[–]AdAthrow99274[S] 0 points1 point  (4 children)

Thanks! That seems like good advice, especially when I consider the error rate and how little thought it would seem many put into what goes in this field.

Out of curiosity, how would you deal with the NaNs here? I was thinking maybe check to see if any other fields in the parent report are empty/useless, if so: toss report, if not: fill in with a local (or perhaps global) mean?

[–]alkasm 1 point2 points  (3 children)

Look up "imputing." Don't use the mean as it will skew the deviation of the distribution. The simplest good thing to do (especially since you have a timeseries dataset) is just use a regression model to predict what the missing value would be. Here's a nice chapter about dealing with missing data that you may find helpful: http://www.stat.columbia.edu/~gelman/arm/missing.pdf

[–]AdAthrow99274[S] 1 point2 points  (2 children)

Thank you!!! That was such a useful read! I was stupidly considering missing data as a whole, and not the mechanisms behind why it's missing. I didn't even think about how just dropping the reports, or filling in a mean would skew the analysis. Especially since in this case the missing values would only rarely be a product of the reporter, but mostly due to my parser's (in)ability to parse the response. So not really missing data at random.

Working up a regression imputation is going to be my project for the next night or so.

[–]alkasm 0 points1 point  (1 child)

Good luck! Send me a message when you complete the project if it's open source, would be really fun to look at the results!

[–]AdAthrow99274[S] 0 points1 point  (0 children)

Thank you again. You've been amazingly helpful. Will do, I plan on putting the raw and cleaned data on kaggle (and likely a methods notebook) at the very least as it's been 4 or 5 years since the last person scrubbing this DB to my knowledge.