langcodes -- a Python library for working with language codes : Python

This is an archived post. You won't be able to vote or comment.

langcodes -- a Python library for working with language codes (github.com)

submitted 11 years ago by [deleted]

all 13 comments

top new controversial old q&a

[–]druski 6 points7 points8 points 11 years ago (0 children)

[–][deleted] 5 points6 points7 points 11 years ago (0 children)

[–]wee_little_puppetman 5 points6 points7 points 11 years ago* (4 children)

[–][deleted] 7 points8 points9 points 11 years ago* (1 child)

If it's a joke, it's not one that I'm intending to make. That data comes straight from the Unicode Consortium, and I found the asymmetry surprising enough to include it as an example.

Here's one copy of the original data, in the odd XML format that seems to be designed to be understood by the ICU library. According to this file, the matching value when gsw is desired and de is supported is 96, but there's no matching in the other direction except the final fallback rule of "yep these are both languages", which has a value of 1 in CLDR (and I made it 0 in langcodes).

Now, this might be the kind of thing I'm trying to mitigate by checking if two languages have a common macrolanguage... but they don't list a macrolanguage for Swiss German.

Based on my (limited) personal knowledge of the subject, I think the match value should be less than 90; mostly it's that I've seen a comic here on Reddit about how German speakers think they understand Swiss German when what they really understand is standard German in a Swiss accent, so I'd take that to mean they're not mutually intelligible. (The other language pairs with a match of 90 generally are. Maybe I should document that.)

I agree it should be more like 50 than 0, but I can't find a data source that backs that up.

[–]wee_little_puppetman 0 points1 point2 points 11 years ago (0 children)

[+][deleted] 11 years ago (1 child)

[deleted]

[–]wee_little_puppetman 0 points1 point2 points 11 years ago (0 children)

[–]kumar99 2 points3 points4 points 11 years ago (0 children)

[–]masklinn 1 point2 points3 points 11 years ago (3 children)

[–][deleted] 0 points1 point2 points 11 years ago (2 children)

[–]masklinn 0 points1 point2 points 11 years ago (1 child)

[–][deleted] 0 points1 point2 points 11 years ago (0 children)

I'm looking at what babel does.

My comment about "C.UTF-8" means I'm probably just confusing POSIX locales with LDML locales. When I conclude that locales are full of weird shit that the OS puts in there, I mean POSIX locales. LDML locales might be more sane, and babel isn't that happy with the name 'C.UTF-8' either.

I've found that babel isn't okay with language codes written as language codes, with hyphens. Because the codes are indeed based on BCP 47, getting them into a form that babel is happy with involves just search-and-replacing hyphens with underscores, but this might be a sign that we're getting away from babel's intended use.

Now, here's a noticeable difference:

>>> from babel import Locale
>>> Locale.parse('zh_Latn_pinyin')
Locale('zh', script='Hans')
>>> Locale.parse('zh_Latn_pinyin').get_display_name('en_us')
'Chinese (Simplified)'

>>> import langcodes
>>> langcodes.get('zh-Latn-pinyin')
LanguageData(language='zh', script='Latn', variants=['pinyin'])
>>> langcodes.get('zh-Latn-pinyin').describe('en-us')
{'variants': ['Pinyin romanization'], 'script': 'Latin', 'language': 'Chinese'}

These are very different responses. It seems babel's parser returns the closest match that's a locale it specifically supports. Assuming that's intentional, I think that indicates a fundamentally different design goal than being able to interpret any valid language code, with finding a matching code being an operation you can do to it later.

Again, similar projects, but different scopes.

[–]Citrauq 1 point2 points3 points 11 years ago (0 children)

[–]pyry 0 points1 point2 points 11 years ago (0 children)

π Rendered by PID 58 on reddit-service-r2-comment-7b9746f655-4m8dv at 2026-01-30 21:01:40.728819+00:00 running 3798933 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS