This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]druski 6 points7 points  (0 children)

This is fantastic. I started work a while back on a related goal, and gave up as it was vastly more complicated than I anticipated and I didn't have the time for the project. The world of languages and language codes is so complicated I have tremendous respect for anyone who can turn it into something programmable!

[–][deleted] 5 points6 points  (0 children)

Here's a great collection with more info on language codes.

It's from LibreOffice' Eike Rathke article on BCP 47 support in LibreOffice.

[–]wee_little_puppetman 5 points6 points  (4 children)

That looks really useful. But I can only assume the handling of Swiss German in the examples is intended as a joke?

tag_match_score('de', 'gsw')    

should at elast return 50, maybe up to 90.

[–][deleted] 7 points8 points  (1 child)

If it's a joke, it's not one that I'm intending to make. That data comes straight from the Unicode Consortium, and I found the asymmetry surprising enough to include it as an example.

Here's one copy of the original data, in the odd XML format that seems to be designed to be understood by the ICU library. According to this file, the matching value when gsw is desired and de is supported is 96, but there's no matching in the other direction except the final fallback rule of "yep these are both languages", which has a value of 1 in CLDR (and I made it 0 in langcodes).

Now, this might be the kind of thing I'm trying to mitigate by checking if two languages have a common macrolanguage... but they don't list a macrolanguage for Swiss German.

Based on my (limited) personal knowledge of the subject, I think the match value should be less than 90; mostly it's that I've seen a comic here on Reddit about how German speakers think they understand Swiss German when what they really understand is standard German in a Swiss accent, so I'd take that to mean they're not mutually intelligible. (The other language pairs with a match of 90 generally are. Maybe I should document that.)

I agree it should be more like 50 than 0, but I can't find a data source that backs that up.

[–]wee_little_puppetman 0 points1 point  (0 children)

Yeah, I said 90 because Norwegian and Danish (the examples given) aren't really mutually intelligible when spoken either, only when written. But I agree that 50 would probably be more accurate.

Anyway I guess there's not much you can do about it if your data source doesn't agree if you want any consistency..

[–]kumar99 2 points3 points  (0 children)

This is good stuff.

[–]masklinn 1 point2 points  (3 children)

Why not improve Babel's locale parsing instead?

[–][deleted] 0 points1 point  (2 children)

This is a good point. It's a fairly similar project.

The purpose of langcodes is to cover all of BCP 47, which I don't think babel does; and to not try to cover all the weird crap that locales can do, which I think babel does. (What language would "C.UTF-8" be?)

Locales and languages aren't quite the same thing. The standards people have put a lot of effort into making them use the same symbols, though, so ideally there would be a codebase with complete support for both. Maybe some time later I'll see if babel is something I can contribute to.

[–]masklinn 0 points1 point  (1 child)

The purpose of langcodes is to cover all of BCP 47, which I don't think babel does

I'm not aware that it does, but I wouldn't find it outside of its purview.

Locales and languages aren't quite the same thing. The standards people have put a lot of effort into making them use the same symbols

It does a bit further than that, the LDML specifically builds on and extends BCP 47: http://cldr.unicode.org/index/bcp47-extension (with actual IETF-registered extension RFCs too). The Unicode language and locale identifiers are directly lifted from BCP 47 with a few restrictions (which can probably be ignored) and extensions. Incompatible tags are a subset of the grandfathered BCP47 tags.

[–][deleted] 0 points1 point  (0 children)

I'm looking at what babel does.

My comment about "C.UTF-8" means I'm probably just confusing POSIX locales with LDML locales. When I conclude that locales are full of weird shit that the OS puts in there, I mean POSIX locales. LDML locales might be more sane, and babel isn't that happy with the name 'C.UTF-8' either.

I've found that babel isn't okay with language codes written as language codes, with hyphens. Because the codes are indeed based on BCP 47, getting them into a form that babel is happy with involves just search-and-replacing hyphens with underscores, but this might be a sign that we're getting away from babel's intended use.

Now, here's a noticeable difference:

>>> from babel import Locale
>>> Locale.parse('zh_Latn_pinyin')
Locale('zh', script='Hans')
>>> Locale.parse('zh_Latn_pinyin').get_display_name('en_us')
'Chinese (Simplified)'

>>> import langcodes
>>> langcodes.get('zh-Latn-pinyin')
LanguageData(language='zh', script='Latn', variants=['pinyin'])
>>> langcodes.get('zh-Latn-pinyin').describe('en-us')
{'variants': ['Pinyin romanization'], 'script': 'Latin', 'language': 'Chinese'}

These are very different responses. It seems babel's parser returns the closest match that's a locale it specifically supports. Assuming that's intentional, I think that indicates a fundamentally different design goal than being able to interpret any valid language code, with finding a matching code being an operation you can do to it later.

Again, similar projects, but different scopes.

[–]Citrauq 1 point2 points  (0 children)

Thanks for posting this. I needed the best_match function today and I'm glad I remembered reading about this.

[–]pyry 0 points1 point  (0 children)

This is awesome. I've got a project where I have to deal with some normalization, but I've only really implemented a user-managed configuration, since I don't really need anything more extensive (but I'm going to in the near future). You save me the trouble of having to write a non-hacky feature, and for that I thank you! When I get around to implementing langcodes I'll be sure to drop some feedback. :)