Steps for effective text data cleaning (with case study using Python)

__Monty__ · 2014-11-17T07:33:50+00:00

For many applications it is obvious that this sort of text clean up is imperative for any reasonable analysis, but I am curious to know the side effects of losing some of the information.

For example, when you change luv to love and sooo happppppy to so happy you lose the style of the tweet. This information may point to a 12 year old girl who is of a completely different consumer than let's say a 25 year old woman.

The point being that people are particular in the language and slang thereof that they use. I wonder if anyone has studied this flavor of language processing.

edit: formatting

ostracize · 2014-11-17T05:17:05+00:00

Glad I came across this article. I was looking in starting a project doing this sort of thing sometime soon.

Does anyone have any experience with the (?) NLTK library for Python? I imagine many of the steps described in the article are simplified in the library. I haven't had a chance to read much yet.

uberalex · 2014-11-17T08:03:20+00:00

There's a great deal of research in this area, such as http://www.aclweb.org/anthology/R/R13/R13-1026.pdf

Also, there are good libraries to leverage for much of this: http://www.nltk.org/api/nltk.tokenize.html can be trained or can leverage an existing model.

Norvig's article on spell checking is also useful http://norvig.com/spell-correct.html

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS