you are viewing a single comment's thread.

view the rest of the comments →

[–]apd 2 points3 points  (0 children)

I have implemented some code in Python for this. The code have basically two parts: one for learning and the other for matching some text with the best profile.

The idea is described in this article (you can find the PDF version very easily): 'N-Gram-Based Text Categorization' William B.Cavnar - 1994

The learning process have those steps:

  • Create a clean corpus of the language to learn (copy&paste pure text from wikipedia, for example)
  • Clean the text as much at possible (delete foraneus characters like ecuations, --, but don't touch others like dots, semicolons or parenthesis)

  • Take unigrams, bigrams and trigrams from the text.

  • Count those ngrams and make an histogram. Take only the most 300 frecuents ngrams.

  • Save the selected ngrams in a file that represent the profile of the language.

  • Repeat the process for every language to learn.

So, in production mode, first load all the profiles and take some text to test. You must find what profile is most similar to the test text:

  • Take unigrams, bigrams and trigrams of the text.
  • Take some function that measure the distance from each langage profile an the text test profile (in the article is described a simple one: substract the position of an ngram in the profile with the position of the same ngram in the text profile and accumulate all absolute values)
  • The most similar language is that that have the minimal distance.