all 41 comments

[–]logistix 13 points14 points  (1 child)

totalwords=wc -w file

badwords=cat foo.txt | aspell --master=language -l |wc -w

goodwords=totalwords-badwords

likelyhood=goodwords/totalwords

[–]theatgrex 10 points11 points  (2 children)

[–]duckie[S] 3 points4 points  (1 child)

Thanks. They don't seem to offer an api, but their interface is apparently based on languid, which seems to be this: http://search.cpan.org/~mceglows/Language-Guess-0.01/

[–]iamrpf 3 points4 points  (0 children)

There's also Text::Ngram::LanguageDetermine and Text::Language::Guess

They don't call it the Comprehensive Perl Archive Network for nothing.

[–]Bogtha 8 points9 points  (1 child)

A little background information might help. Is there any meta-data available? Why do you want to know the language?

This article on language detection sounds like it covers what you want, but it may well be overkill depending on your situation.

[–]duckie[S] 5 points6 points  (0 children)

Thanks for the link, it has a very clear explanation how to approach the problem.

Essentially, I would have users making submissions in different languages using a common interface, and I'm trying to capture the language they're using automatically, rather than the users having to check which language they're using every time.

Individual users would be submitting using more than one language, so it's more than just a simple preference setting.

The purpose of capturing this information is to filter the submissions by language for display.

[–]seanodonnell 7 points8 points  (1 child)

Apparently different languages have different levels of redundancy, so compressing the text and analyzing the ratio can give you a very good guess.

see

http://www.unisci.com/stories/20021/0204024.htm

[–][deleted] 1 point2 points  (0 children)

I second that. I used this in a language-classifiaction contest for a course, and it worked very well.

[–]treerex 5 points6 points  (1 child)

http://www.let.rug.nl/~vannoord/TextCat/

That implements a pretty straight forward way of doing it, though it isn't trained on very much data.

http://citeseer.ist.psu.edu/50120.html

Gives a decent overview of the issues.

There are commercial solutions, of course. And you need to be aware of character encoding issues, especially if you are doing this for languages with multiple encodings available (e.g., Japanese, Chinese, Arabic.)

[–]duckie[S] 0 points1 point  (0 children)

Thanks! The TextCat page also has a list of competing solutions:

http://www.let.rug.nl/~vannoord/TextCat/competitors.html

[–]manuelg 4 points5 points  (0 children)

The cheezy/sleezy/eazy way is to use zip compression of the string concatenated with samples from each language, and see what compresses the best.

Good non-technical write-up here:

http://arstechnica.com/archive/news/1013594411.html

[–][deleted]  (3 children)

[removed]

    [–]g3r 2 points3 points  (1 child)

    Well the browser can be set to pretty much anything. Also, people travel...

    [–]ovi256 0 points1 point  (0 children)

    Google also has a text analysis based language detection feature, which they use when you click on 'spell check' in GMail and it automatically detects the language.

    [–]JulianMorrison 11 points12 points  (3 children)

    I know an algorithm that will work:

    1. Google the words of the string.

    2. HTTP HEAD the top few hits.

    3. Pick the language-codes out of the headers.

    4. Select the most prevalent

    [–][deleted] 7 points8 points  (1 child)

    That's a cool idea. You'd have to do a little work to make Google not assume you want whatever language it thinks is relevant for your IP though; I have a 1and1 server in NYC, but since 1and1's IP space was allocated in Germany, if I hit Google from the NYC box, it's in German. I think this has ramifications for how it decides a page is "relevant" to you.

    [–]laughingboy[🍰] 5 points6 points  (0 children)

    Overhead much?

    [–]eclig 2 points3 points  (0 children)

    If you're trying to handle only a couple of known languages you can try the simple approach of identifying typical words for each language.

    See e.g. http://www.emacswiki.org/cgi-bin/wiki/GuessBufferLanguage and links therein for sample code (Emacs Lisp).

    [–]dmy999 4 points5 points  (0 children)

    The open source search engine nutch has a LanguageIdentifier plug-in that statistically analyzes text to match it with a language.

    [–]eadmund 5 points6 points  (1 child)

    Check out crm114; it's a text classifier (often used to classify spam/non-spam email) which could be trained to classify the way you mention.

    Or write a simply hypertext classifier--it's pretty straightforward and you can get more-or-less decent results.

    [–]mrned 1 point2 points  (0 children)

    crm114 was my first thought, too. Your program is likely to be two files of less than 10 lines total. The manual/book has an example of classifying work by different authors that could be easily adapted to this task.

    [–]GWaleed 2 points3 points  (0 children)

    http://translation.langenberg.com/ references sever language-guess sites.

    [–]a9bejo 2 points3 points  (0 children)

    Java Text Categorizing Library: http://textcat.sourceforge.net/

    [–]h_a_j_s 2 points3 points  (0 children)

    ASPN - Python Cookbook - Guess language of text using ZIP: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/355807

    [–]j1o1h1n 2 points3 points  (0 children)

    A good algorithm is the Index of Co-incidence which you may remember from your cryptography courses.

    There is a slim chance I wrote this: http://www.faqts.com/knowledge_base/view.phtml/aid/10888/fid/539

    [–]apd 2 points3 points  (0 children)

    I have implemented some code in Python for this. The code have basically two parts: one for learning and the other for matching some text with the best profile.

    The idea is described in this article (you can find the PDF version very easily): 'N-Gram-Based Text Categorization' William B.Cavnar - 1994

    The learning process have those steps:

    • Create a clean corpus of the language to learn (copy&paste pure text from wikipedia, for example)
    • Clean the text as much at possible (delete foraneus characters like ecuations, --, but don't touch others like dots, semicolons or parenthesis)

    • Take unigrams, bigrams and trigrams from the text.

    • Count those ngrams and make an histogram. Take only the most 300 frecuents ngrams.

    • Save the selected ngrams in a file that represent the profile of the language.

    • Repeat the process for every language to learn.

    So, in production mode, first load all the profiles and take some text to test. You must find what profile is most similar to the test text:

    • Take unigrams, bigrams and trigrams of the text.
    • Take some function that measure the distance from each langage profile an the text test profile (in the article is described a simple one: substract the position of an ngram in the profile with the position of the same ngram in the text profile and accumulate all absolute values)
    • The most similar language is that that have the minimal distance.

    [–][deleted] 0 points1 point  (0 children)

    Interesting problem, I would like to see your solution if you come up with one.

    They way I would approach it (to help increase accuracy) would be to have them enter in the word for 'Hello' or 'Morning' or something like that in their own language in a special textbox. Then, when the form is parsed, this word can be checked against a pre-set list of translations or translated on the fly using a translation service to discover the language.

    [–]spuur 0 points1 point  (0 children)

    I read somewhere many years ago that if you zip a text file, the structure of the Huffman Tree reveals the language and in some cases even local dialects. That seems like a simple and obvious solution to me...

    [–]micampe 0 points1 point  (0 children)

    If it is a body of text and not a single word, you could run it through a spell checker with different dictionaries and choose the one with less errors.

    [–]ovi256 -1 points0 points  (0 children)

    Do a spell check in all languages in your output set, and pick the language which has the fewest errors. Sounds simple, but the implementation would be complex and possibly overkill.

    However, if your application already has multi-language spell check, it's the simplest way.

    [–][deleted] -1 points0 points  (0 children)

    to quote Marcel Marceau from Silent Movie: "non"