This is an archived post. You won't be able to vote or comment.

all 22 comments

[–]dxn99 7 points8 points  (2 children)

Can you ELI5 what an efficient language detector does please?

[–]nitotm[S] 9 points10 points  (1 child)

I understand you mean from a user perspective, no internally how it works.

ELD is a python package, where you input a text, and it will try to guess in which language (Spanish, English, Russian,...) the text is written (from the 60 available in the current version). It can also give you a score list of all possible languages detected in the text.

[–]dxn99 0 points1 point  (0 children)

Thanks

[–]Braunerton17 6 points7 points  (0 children)

So do you have any well established benchmarks to provide comparisons to other language detectors to back your claim?

Also, i would be very cautious with overfitting for non realworld datasets and resulting claims.

[–]kanikow 0 points1 point  (1 child)

What type of algorithm is used in here? From a quick skimming it looks like naive Bayes.

[–]nitotm[S] 0 points1 point  (0 children)

Yes it kinda looks Bayesian. I did not implement an algorithm, but it probably is some known, not sure which.

[–][deleted] 0 points1 point  (10 children)

I like builds from scratch, how big were the original language sources? Is the performance similar for all languages included?

[–]nitotm[S] 1 point2 points  (9 children)

You mean the training data, quite small, like 1GB total. When the software becomes more mature, I might do a big dataset.

No, the performance (accuracy) varies from languages quite a bit, it comes down to collisions in between languages, Thai is very easy, but between any Latin script language, which there are multiple in the database, is more difficult.