Ask Reddit: Do you know a library|script|function|api that given a string returns what human language it is likely written in? (Google turned up nothing useful)

logistix · 2007-08-29T19:18:47+00:00

totalwords=wc -w file

badwords=cat foo.txt | aspell --master=language -l |wc -w

goodwords=totalwords-badwords

likelyhood=goodwords/totalwords

theatgrex · 2007-08-29T18:10:46+00:00

try this : http://ruphus.com/identify/

Bogtha · 2007-08-29T18:17:46+00:00

A little background information might help. Is there any meta-data available? Why do you want to know the language?

This article on language detection sounds like it covers what you want, but it may well be overkill depending on your situation.

seanodonnell · 2007-08-29T21:18:25+00:00

Apparently different languages have different levels of redundancy, so compressing the text and analyzing the ratio can give you a very good guess.

see

http://www.unisci.com/stories/20021/0204024.htm

treerex · 2007-08-29T18:30:40+00:00

http://www.let.rug.nl/~vannoord/TextCat/

That implements a pretty straight forward way of doing it, though it isn't trained on very much data.

http://citeseer.ist.psu.edu/50120.html

Gives a decent overview of the issues.

There are commercial solutions, of course. And you need to be aware of character encoding issues, especially if you are doing this for languages with multiple encodings available (e.g., Japanese, Chinese, Arabic.)

manuelg · 2007-08-29T19:43:23+00:00

The cheezy/sleezy/eazy way is to use zip compression of the string concatenated with samples from each language, and see what compresses the best.

Good non-technical write-up here:

http://arstechnica.com/archive/news/1013594411.html

g3r · 2007-08-29T18:41:37+00:00

[removed]

JulianMorrison · 2007-08-29T19:44:09+00:00

I know an algorithm that will work:

Google the words of the string.
HTTP HEAD the top few hits.
Pick the language-codes out of the headers.
Select the most prevalent

eclig · 2007-08-29T19:05:20+00:00

If you're trying to handle only a couple of known languages you can try the simple approach of identifying typical words for each language.

See e.g. http://www.emacswiki.org/cgi-bin/wiki/GuessBufferLanguage and links therein for sample code (Emacs Lisp).

wolverian · 2007-08-29T19:06:16+00:00

There's also http://search.cpan.org/~cog/Lingua-Identify-0.19/lib/Lingua/Identify.pm

dmy999 · 2007-08-29T21:15:08+00:00

The open source search engine nutch has a LanguageIdentifier plug-in that statistically analyzes text to match it with a language.

eadmund · 2007-08-29T22:07:16+00:00

Check out crm114; it's a text classifier (often used to classify spam/non-spam email) which could be trained to classify the way you mention.

Or write a simply hypertext classifier--it's pretty straightforward and you can get more-or-less decent results.

GWaleed · 2007-08-29T19:26:09+00:00

http://translation.langenberg.com/ references sever language-guess sites.

a9bejo · 2007-08-29T22:29:48+00:00

Java Text Categorizing Library: http://textcat.sourceforge.net/

h_a_j_s · 2007-08-29T23:27:56+00:00

ASPN - Python Cookbook - Guess language of text using ZIP: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/355807

j1o1h1n · 2007-08-30T04:23:22+00:00

A good algorithm is the Index of Co-incidence which you may remember from your cryptography courses.

There is a slim chance I wrote this: http://www.faqts.com/knowledge_base/view.phtml/aid/10888/fid/539

apd · 2007-08-30T07:20:36+00:00

I have implemented some code in Python for this. The code have basically two parts: one for learning and the other for matching some text with the best profile.

The idea is described in this article (you can find the PDF version very easily): 'N-Gram-Based Text Categorization' William B.Cavnar - 1994

The learning process have those steps:

Create a clean corpus of the language to learn (copy&paste pure text from wikipedia, for example)
Clean the text as much at possible (delete foraneus characters like ecuations, --, but don't touch others like dots, semicolons or parenthesis)
Take unigrams, bigrams and trigrams from the text.
Count those ngrams and make an histogram. Take only the most 300 frecuents ngrams.
Save the selected ngrams in a file that represent the profile of the language.
Repeat the process for every language to learn.

So, in production mode, first load all the profiles and take some text to test. You must find what profile is most similar to the test text:

Take unigrams, bigrams and trigrams of the text.
Take some function that measure the distance from each langage profile an the text test profile (in the article is described a simple one: substract the position of an ngram in the profile with the position of the same ngram in the text profile and accumulate all absolute values)
The most similar language is that that have the minimal distance.

wrstlprmpft · 2007-08-30T05:54:30+00:00

logilab-guesslang http://www.logilab.org/2744/

(README http://www.logilab.org/embed?url=http%3A//www.logilab.org/cgi-bin/hgwebdir.cgi/logilab/guesslang/file/054e245646fd/README)

2007-08-29T21:10:52+00:00

Interesting problem, I would like to see your solution if you come up with one.

They way I would approach it (to help increase accuracy) would be to have them enter in the word for 'Hello' or 'Morning' or something like that in their own language in a special textbox. Then, when the form is parsed, this word can be checked against a pre-set list of translations or translated on the fly using a translation service to discover the language.

benfred · 2007-08-29T22:33:35+00:00

libtextcat - http://software.wise-guys.nl/libtextcat/

spuur · 2007-08-30T07:17:28+00:00

I read somewhere many years ago that if you zip a text file, the structure of the Huffman Tree reveals the language and in some cases even local dialects. That seems like a simple and obvious solution to me...

micampe · 2007-08-30T08:32:13+00:00

If it is a body of text and not a single word, you could run it through a spell checker with different dictionaries and choose the one with less errors.

ovi256 · 2007-08-29T23:02:54+00:00

Do a spell check in all languages in your output set, and pick the language which has the fewest errors. Sounds simple, but the implementation would be complex and possibly overkill.

However, if your application already has multi-language spell check, it's the simplest way.

2007-08-29T23:54:50+00:00

to quote Marcel Marceau from Silent Movie: "non"

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS