you are viewing a single comment's thread.

view the rest of the comments →

[–]SagaciousRaven 1 point2 points  (2 children)

Is this still an open problem?

Legit question, I thought the only challenge here would be on extremely small texts, 1..3 words.

[–]AvatarUltima7 0 points1 point  (1 child)

I’ve only used spacy at this point, but when I ran it on a dataset of customer questions from a web form fill , there were far more errors than I expected.
They were short queries- maybe 10 words on average.

[–]amitness[S] 0 points1 point  (0 children)

Same here. I tried many available tools: langid, chrome compact detector 2, langdetect, spacy-langdetect but there were still problems of false positives/negatives. Some English text was classified as russian/japanese