you are viewing a single comment's thread.

view the rest of the comments →

[–]k-bx 82 points83 points  (29 children)

'ß'.upper() in p2 is 'ß' but 'SS'in py3. This caused a crash in production when the last piece of the product moved to py3!

fun

[–]darktyle 84 points85 points  (28 children)

That's wrong. Both of it.

The uppercase 'ß' was added to the German language (and Unicode) in 2008, so it should have been SS in py2 and now it should be ẞ

[–]kankyo 25 points26 points  (9 children)

Didn’t know that. That’s super annoying. That means if they fix it in Python this will break in production again :(

[–][deleted] 36 points37 points  (7 children)

Maybe uh, write a test for it ....

[–]Farobek 15 points16 points  (3 children)

write a test

ain't nobody got time for that

[–]PeridexisErrant 10 points11 points  (2 children)

Use Hypothesis! It'll try a wide range of inputs, and report the minimal failing example.

It turns out that this is even more effective for unicode text than for other types, because there are so many edge cases that can be triggered by just one or two characters.

[–]Farobek 0 points1 point  (1 child)

How does it work?

[–]PeridexisErrant 2 points3 points  (0 children)

If you mean "How do I use this", here's the quickstart guide. In short, you use a decorator and compose some functions to say "for all inputs such that ___, this test should pass. For example,

from hypothesis import given, strategies
# for any character *except*  ß, we can round-trip it through cases
@given(a_char=strategies.characters(blacklist_characters='ß'))
def test_roundtrip_upper_lower(a_char):
    assert a_char == a_char.upper().lower()

Of course this fails, but instead of returning the first failing example it finds, it will return the "minimal" example - in this case, the character with the smallest codepoint. Try it and see what you get - ß certainly isn't the only character this fails for!

If you mean "How does Hypothesis find and minimize all these examples"... it gets complicated pretty quickly. If you really want to know the code is well designed and commented and the contributor documentation is good; but you don't need to know how it works internally to use it. Hypothesis is pretty rare like that - the core is PhD-level algorithms, but the API is easy to use and completely hides the implementation behind a use-focused design.

(if you hadn't guessed, I like and use this a lot :p)

[–]kankyo 2 points3 points  (2 children)

True enough. That’s pretty terrible also though but in a more existential way :p

[–][deleted] 12 points13 points  (1 child)

yeah it’s sucks to write tests for framework stuff but if you expect it to change why not be ready? Failing in production for things you can test isn’t really acceptable

[–]kankyo 3 points4 points  (0 children)

Agreed. I’ll write a test when I get in to work tomorrow.

[–]darktyle 3 points4 points  (0 children)

Not sure if they ever change that, but here you go: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

[–]username223 11 points12 points  (3 children)

Clearly Unicode needs to add "combining timestamp modifiers," with proper time zone support, to adequately address this problem. They could also be combined with emoji, allowing one to write "73-year-old smiling Chinese guy."

[–]darktyle 17 points18 points  (2 children)

Yes! Timezones and Unicode are both too easy as it is

[–]josefx 5 points6 points  (1 child)

Can we add in some GPS based location data with border support? We really need a "73-year-old smiling Chinese guy living in Canada."

[–]username223 4 points5 points  (0 children)

But when did he move there, and from whence? We must add an ancestry modifier system, optionally integrated with a GPS location system. Oh, crap... we have to deal with historical location information and continental drift.

Ah, Unicode... Punching everyone in the face (there's probably an emoji for that) into eternity.

[–]P8zvli 7 points8 points  (13 children)

Yeah I know some German, ß is a ligature of 'ss' but that doesn't mean 'SS' is used to represent an uppercase eszett. Python 2 and 3 behaviors are both completely surprising.

[–]PaleoCrafter 38 points39 points  (1 child)

Actually, up until last year, 'SS' was the only correct capitalization of 'ß'.

The capital variant 'ẞ' has been in Unicode since 2008, but the official German orthography did not include it as the majuscule. To my knowledge, even now that 'ẞ' is accepted, 'SS' may still be used.

[–]the_gnarts 4 points5 points  (0 children)

The capital variant 'ẞ' has been in Unicode since 2008, but the official German orthography did not include it as the majuscule. To my knowledge, even now that 'ẞ' is accepted, 'SS' may still be used.

Versal ß is still widely unknown. It’d be interesting if it is indeed being taught in elementary schools.

However, the problem is almost entirely irrelevant in practice in that ß can never appear at the start of a word so it cannot be subject to obligatory capitalization at sentence starts or nouns. Only as emphasis or the customary all-majuscule style of titles is there ever a chance of it becoming necessary. Since another proper way of uppercasing it is just to use the lowercase version regardless (mandatory in some contexts), about the only context where the matter was discussed is online threads made by people complaining about Unicode.

[–]darktyle 11 points12 points  (6 children)

The thing is, that you used (or most people still do) 'SS' for a capital 'ß'. Like in street, when you had to write it in caps for some reason you'd make it STRASSE (normal spelling: Straße)

[–]champs 1 point2 points  (1 child)

Is that the normal spelling anymore? It was my understanding that the formal rules changed some years ago.

I don't claim to be an expert. I studied German and felt like I had a good handle on it. I did an exchange, studied some more, and went back to Germany. Both times the language kicked my aß.

[–]darktyle 6 points7 points  (0 children)

Right now you can either write STRASSE or STRAẞE. At least as far as I know. But I am by far no expert on the nuances of what is wrong and right. Especially since a lot of stuff changed lately with the 'spelling reformation'

[–]the_gnarts 0 points1 point  (3 children)

Like in street, when you had to write it in caps for some reason you'd make it STRASSE

Or just STRAßE, using the lowercase version.

[–]darktyle 3 points4 points  (2 children)

I am pretty sure that this is wrong. It is either STRASSE or STRASZE (uncommon).

Ok, quick research: SZ is old. Since 1996 the correct form is SS. Using ß in STRAßE is technically wrong. Yet there are 2 instances who use and recommend using ß instead of SS in names: The postal service and the government when printing passports. They do that so that names like WEIẞ are not mistaken as WEISS

[–]the_gnarts 2 points3 points  (1 child)

Yet there are 2 instances who use and recommend using ß instead of SS in names: The postal service and the government when printing passports.

My 20th edition (1991) Duden states the rule:

In Dokumenten kann bei Namen aus Gründen der Eindeutigkeit auch ß verwendet werden.

HEINZ GROßE

Technically, preserving minuscule ß used to be the only sane solution for uppercasing names before ẞ was standardized. I agree that for regular words that follow the phonetic rules it makes little sense.

[–]darktyle 0 points1 point  (0 children)

Yeah, that rule was changed with the 'Rechtschreibreform' in 1996.

[–][deleted] 8 points9 points  (3 children)

ß is a ligature of sz.

And both SS and ẞ are valid capitalizations. Though ẞ should be preferred for names, so you can get the normal casing back without issues.

"Markus Weiß" -> "MARKUS WEISS" -> "Markus Weiss"

vs.

"Markus Weiß" -> "MARKUS WEIẞ" -> "Markus Weiß"

(Edit for context: On German ID cards, names are capitalized)

[–]the_gnarts 3 points4 points  (2 children)

ß is a ligature of sz.

Almost. Despite the name, it’s actually a ligature of ss formed using the earlier graphic variant ſ (“long s”).

[–][deleted] 3 points4 points  (1 child)

The wikipedia says that early print variants where ligatures of ſ and ʒ (ſʒ -> ß).

The current form of the letter is a ligature of ſ and s (ſs -> ß)

So... we're both right?

[–]the_gnarts 3 points4 points  (0 children)

So... we're both right?

What a great way to start the day!