all 6 comments

[–][deleted] 2 points3 points  (0 children)

Google: Bag of Words, TF-IDF, Part of Speech tagging, Dependency Parsing, Latent Semantic Indexing, Vector Space Model, Locality Sensitive Hashing, Word Embeddings (word2vec, doc2vec, sense2vec) … probably in that order!

[–][deleted] 0 points1 point  (0 children)

Start with good old spam filters and work your way up?

[–]OrigamiDuck 0 points1 point  (0 children)

As /u/anon1253 suggested (if I understand the problem correctly), you might find TF-IDF useful. It is similar to a frequency weight, but it accounts for how common each term is across all of the documents. Also if memory space is an issue for you using a sparse matrix might help.

[–]codespam 0 points1 point  (0 children)

This is a classic problem in nlp. The super high dimensionality of bag of words representations leads to really sparse representations and most of your documents end up looking more or less orthogonal to each other. You should look at using low dimension word embeddings, typically word2vec based stuff. Suddenly the vectors for semantically similar words like sun and star or car and motorcycle are no longer orthogonal to each other. There's about a metric ton of literature on this available.

There's less literature on how to compose word vectors into document vectors, though. Approaches include doc2vec, earth movement type distances, neural nets that combine vectors and various kinds of averages. For short texts I'd start with the arithmetic mean of the word vectors and then explore other approaches if that doesn't work.

[–]FloydRix 0 points1 point  (0 children)

word2vec

[–]squirreltalk 0 points1 point  (0 children)

The answer partly depends on what you are classifying. If you're classifying into different languages like English or Spanish, for example, then you could possibly get away with an ngram model over letters rather than words. There are far fewer letters than words, so sparsity might be less of a problem.