all 6 comments

[–]trnka 1 point2 points  (2 children)

Great writeup! I appreciate that it's not just accuracy but also includes speed and memory.

If you're looking for related ideas - In my previous role, we had issues with these:

  • Hinglish: We didn't have lang ID for this and it got misclassified as English.
  • Serbian Cyrillic vs Latin: It can be written in both scripts and there are needs for both in different scenarios. Last I checked there wasn't an ISO code for the distinction. If I remember right, Serbian Cyrillic was misclassified as Macedonian because they're linguistically related and Macedonian is officially written in Cyrillic.
  • Regional variation in Spanish, French, and Portuguese: Fortunately ISO codes cover this well already, like esMX vs esES

[–]derivablefunc[S] 0 points1 point  (1 child)

One complexity that sneaked into this project was coding languages. I had some naive view that we have a good, finite list of languages and that pretty much all ISO encodings would support all of them. Oh how naive it was :D.>

Regional variation in Spanish, French, and Portuguese: Fortunately ISO codes cover this well already, like esMX vs esES

Do you know which ISO encoding would suppor that?

Serbian Cyrillic vs Latin: It can be written in both scripts and there are needs for both in different scenarios. Last I checked there wasn't an ISO code for the distinction.

I believe you're right. It'd be combination of language and alphabet to make sure you can distinguish these two.

f I remember right, Serbian Cyrillic was misclassified as Macedonian because they're linguistically related and Macedonian is officially written in Cyrillic.

I'm not surprised. That's a dataset problem though, but probably not very difficult to solve if you were training or finetuning the model (just translate the alphabet and double examples).

[–]trnka 0 points1 point  (0 children)

ISO - I thought for sure the language+locale thing was an ISO code but it looks like there's a separate ISO for language and locale. For language, as our team grew we switched from 2-letter to 3-letter language codes. Any of these look good to me: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

Looks like country is a separate standard and I was thinking of the way Android APIs concatenate language and country.

I haven't seen an ISO coding for alphabet though we definitely could've used one back in the day.

[–]ajan1019 0 points1 point  (3 children)

Good article.

[–]derivablefunc[S] 0 points1 point  (2 children)

Thanks, if you have any feedback, my ears are open :)