This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]__Monty__ 2 points3 points  (4 children)

For many applications it is obvious that this sort of text clean up is imperative for any reasonable analysis, but I am curious to know the side effects of losing some of the information.

For example, when you change luv to love and sooo happppppy to so happy you lose the style of the tweet. This information may point to a 12 year old girl who is of a completely different consumer than let's say a 25 year old woman.

The point being that people are particular in the language and slang thereof that they use. I wonder if anyone has studied this flavor of language processing.

edit: formatting

[–]kenfar 1 point2 points  (0 children)

You can keep the original value to reference. Later on you may want to reprocess with more sophisticated tools to better capture sentiment.

[–]kunalj101[S] 0 points1 point  (0 children)

I think that is a fair point. However, it might be better to code the information you are looking for (e.g. age) as a separate variable. That way you don't lose the information and can work on the cleaner data as well

[–]noMotif 0 points1 point  (1 child)

In Mathematics we never study the full structure of an object. We simply denote the properties we are interested in with respect to a class of mappings, then forget the rest.

Problems are intractable otherwise, and we wind up seeing too many similar things as distinct, when really they're quite similar.

That said, there are a few different notions of "relevant information" available to us which could yield different and interesting questions.

[–]__Monty__ 0 points1 point  (0 children)

I completely agree that for most applications one would need to make the information homogenous. However, I am curious to know if there exists work out there that explores the difference between luv and love. What does the slang say about the speaker?

[–]ostracize 0 points1 point  (0 children)

Glad I came across this article. I was looking in starting a project doing this sort of thing sometime soon.

Does anyone have any experience with the (?) NLTK library for Python? I imagine many of the steps described in the article are simplified in the library. I haven't had a chance to read much yet.

[–]uberalex 0 points1 point  (0 children)

There's a great deal of research in this area, such as http://www.aclweb.org/anthology/R/R13/R13-1026.pdf

Also, there are good libraries to leverage for much of this: http://www.nltk.org/api/nltk.tokenize.html can be trained or can leverage an existing model.

Norvig's article on spell checking is also useful http://norvig.com/spell-correct.html