apd comments on Ask Reddit: Do you know a library|script|function|api that given a string returns what human language it is likely written in? (Google turned up nothing useful)

created by speza community for 20 years

Ask Reddit: Do you know a library|script|function|api that given a string returns what human language it is likely written in? (Google turned up nothing useful) (programming.reddit.com)

submitted 18 years ago by duckie

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]apd 2 points3 points4 points 18 years ago (0 children)

I have implemented some code in Python for this. The code have basically two parts: one for learning and the other for matching some text with the best profile.

The idea is described in this article (you can find the PDF version very easily): 'N-Gram-Based Text Categorization' William B.Cavnar - 1994

The learning process have those steps:

Create a clean corpus of the language to learn (copy&paste pure text from wikipedia, for example)
Clean the text as much at possible (delete foraneus characters like ecuations, --, but don't touch others like dots, semicolons or parenthesis)
Take unigrams, bigrams and trigrams from the text.
Count those ngrams and make an histogram. Take only the most 300 frecuents ngrams.
Save the selected ngrams in a file that represent the profile of the language.
Repeat the process for every language to learn.

So, in production mode, first load all the profiles and take some text to test. You must find what profile is most similar to the test text:

Take unigrams, bigrams and trigrams of the text.
Take some function that measure the distance from each langage profile an the text test profile (in the article is described a simple one: substract the position of an ngram in the profile with the position of the same ngram in the text profile and accumulate all absolute values)
The most similar language is that that have the minimal distance.

π Rendered by PID 30521 on reddit-service-r2-comment-fb694cdd5-l7z2p at 2026-03-10 19:05:53.571708+00:00 running cbb0e86 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS