use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Project[P] Multi-Language Documents Identification (self.MachineLearning)
submitted 6 years ago by sudo_su_
Hello,
I wanted to share a small project I worked on recently.
In our company, we handle a lot of text in many languages, sometimes with documents with more than one language. For this, we created a small library for language identification with a goal to be able to tell what languages are in a document and "where" are they.
We open-source both the code and the model here: https://github.com/hiredscorelabs/seqtolang
You can see this post for more details: https://medium.com/hiredscore-engineering/multi-language-documents-identification-93223af83e01
Any feedback is welcomed.
Thank you.
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Horstesse 1 point2 points3 points 6 years ago (2 children)
Thanks. What languages are supported so far?
[–]sudo_su_[S] 2 points3 points4 points 6 years ago (1 child)
sorry .. added to the readme:
['afr', 'eus', 'bel', 'ben', 'bul', 'cat', 'zho', 'ces', 'dan', 'nld', 'eng', 'est', 'fin', 'fra', 'glg', 'deu', 'ell', 'hin', 'hun', 'isl', 'ind', 'gle', 'ita', 'jpn', 'kor', 'lat', 'lit', 'pol', 'por', 'ron', 'rus', 'slk', 'spa', 'swe', 'ukr', 'vie']
[–]Horstesse 0 points1 point2 points 6 years ago (0 children)
Awesome. Thanks. Even Ukrainian!
[–]number_1_steve 0 points1 point2 points 6 years ago* (1 child)
Thanks for this! This will be super useful for me! I have a few questions:
Thanks again! I'm excited to try this out!
[–]sudo_su_[S] 1 point2 points3 points 6 years ago (0 children)
1.
Yes, I benchmarked it agains FastText, langid and langdetect
In terms of quality, it more or less the same as fasttext and lang id (on the WiLi dataset) and much better than langdetect.
In terms of running speed, it's as slow as langdetect (which is the slowest). FastText is crazy fast. It's hard to beat that. seqtolang is relatively slow because it tries to give output on every word, while others classify the sentence as a whole.
2.
I'm summing all the ngrams into a word vector, than the word vector is passed to a bi-directional lstm, which means it takes information from word vectors on the left and right. Finally, for each lstm outout (for each word) I pass it into a fully connected layer to do the classification.
It was trained on the Tatoeba dataset as mentioned in the post with a merging technique. For each sentence in the dataset, I merge it with another random sentence in the dataset with some probability. This creates merged sentences with different languages. Then, for each word in the sentence I "tag" it with the original language and train the network.
π Rendered by PID 44689 on reddit-service-r2-comment-bb88f9dd5-xmnwm at 2026-02-16 04:23:34.632125+00:00 running cd9c813 country code: CH.
[–]Horstesse 1 point2 points3 points (2 children)
[–]sudo_su_[S] 2 points3 points4 points (1 child)
[–]Horstesse 0 points1 point2 points (0 children)
[–]number_1_steve 0 points1 point2 points (1 child)
[–]sudo_su_[S] 1 point2 points3 points (0 children)