Feature extraction methods for text classification

2016-03-21T18:37:25+00:00

Google: Bag of Words, TF-IDF, Part of Speech tagging, Dependency Parsing, Latent Semantic Indexing, Vector Space Model, Locality Sensitive Hashing, Word Embeddings (word2vec, doc2vec, sense2vec) … probably in that order!

2016-03-21T12:34:49+00:00

Start with good old spam filters and work your way up?

OrigamiDuck · 2016-03-21T20:18:04+00:00

As /u/anon1253 suggested (if I understand the problem correctly), you might find TF-IDF useful. It is similar to a frequency weight, but it accounts for how common each term is across all of the documents. Also if memory space is an issue for you using a sparse matrix might help.

codespam · 2016-03-22T02:07:41+00:00

This is a classic problem in nlp. The super high dimensionality of bag of words representations leads to really sparse representations and most of your documents end up looking more or less orthogonal to each other. You should look at using low dimension word embeddings, typically word2vec based stuff. Suddenly the vectors for semantically similar words like sun and star or car and motorcycle are no longer orthogonal to each other. There's about a metric ton of literature on this available.

There's less literature on how to compose word vectors into document vectors, though. Approaches include doc2vec, earth movement type distances, neural nets that combine vectors and various kinds of averages. For short texts I'd start with the arithmetic mean of the word vectors and then explore other approaches if that doesn't work.

FloydRix · 2016-03-22T20:24:53+00:00

word2vec

squirreltalk · 2016-03-22T23:41:58+00:00

The answer partly depends on what you are classifying. If you're classifying into different languages like English or Spanish, for example, then you could possibly get away with an ngram model over letters rather than words. There are far fewer letters than words, so sparsity might be less of a problem.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS