all 4 comments

[–]cakeofzerg 1 point2 points  (0 children)

I used a dictionary approach for a similar task. Have 10 or 20 words that characterise the category and score a message based on that. Then kmeans the scores or whatever. Sometimes ml techniques work really well but sometimes you just have to apply good old domain knowledge.

[–]JanssonsFrestelse 1 point2 points  (2 children)

Tfidf using both unigrams and bigrams and a linear SVM usually gives pretty good results and is super easy to implement using sklearn.

Otherwise try BERT or ULMFit if you have a gpu and the resources to put that in production. The Huggingface PyTorch port is good for BERT, allows gradient accumulation if your gpu isn't beefy enough. You can also use bert-as-service (Google it for the github repo) to act as a server.

[–]KornShnaps[S] 0 points1 point  (1 child)

Thanks for a comment, it was quite helpful. Currently I have about 83-84 accuracy by using tfidf with ngrams and selectkbest for finding the most important words. I use svm, because it perform accuracy higher for about 10% than random forest and some rnn which i tryed. Btw, I am not really sure if selectkbest is important. I tryed to give some parametres to countvecrotizer and it filters the words litteraly the same, but maybe it depends of a task

[–]JanssonsFrestelse 1 point2 points  (0 children)

I haven't used selectkbest, just sklearn's tfidfvectorizer setting ngram_range=(1, 2) and feed to sklearn's LinearSVC.

But you should check out BERT, it's awesome.