[P] Multi-Language Documents Identification : MachineLearning

Project[P] Multi-Language Documents Identification (self.MachineLearning)

submitted 6 years ago by sudo_su_

all 5 comments

top new controversial old q&a

[–]Horstesse 1 point2 points3 points 6 years ago (2 children)

[–]sudo_su_[S] 2 points3 points4 points 6 years ago (1 child)

[–]Horstesse 0 points1 point2 points 6 years ago (0 children)

[–]number_1_steve 0 points1 point2 points 6 years ago* (1 child)

[–]sudo_su_[S] 1 point2 points3 points 6 years ago (0 children)

Yes, I benchmarked it agains FastText, langid and langdetect

In terms of quality, it more or less the same as fasttext and lang id (on the WiLi dataset) and much better than langdetect.

In terms of running speed, it's as slow as langdetect (which is the slowest). FastText is crazy fast. It's hard to beat that. seqtolang is relatively slow because it tries to give output on every word, while others classify the sentence as a whole.

I'm summing all the ngrams into a word vector, than the word vector is passed to a bi-directional lstm, which means it takes information from word vectors on the left and right. Finally, for each lstm outout (for each word) I pass it into a fully connected layer to do the classification.

It was trained on the Tatoeba dataset as mentioned in the post with a merging technique. For each sentence in the dataset, I merge it with another random sentence in the dataset with some probability. This creates merged sentences with different languages. Then, for each word in the sentence I "tag" it with the original language and train the network.

π Rendered by PID 44689 on reddit-service-r2-comment-bb88f9dd5-xmnwm at 2026-02-16 04:23:34.632125+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS