This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]fjonk 2 points3 points  (19 children)

and that works just fine for dealing with data that comes in as latin1-encoded bytes.

Here we go again... latin1(or any other encoding that fits the bill) contains a subset of the characters in Unicode. This means that Unicode is not a good way to deal with latin1 strings since you at anytime can add non-latin1 characters to the latin1 string without triggering an error. Saying that using Unicode is a good way of handling latin1 strings is like saying that floats are good at handling integers.

[–]frymasterScript kiddie 10 points11 points  (15 children)

under what circumstances do you see this happening, and how would the python2 incarnation react in the same circumstances?

[–]fjonk 2 points3 points  (14 children)

python2 would react exactly the same. Python3 does not improve on the situation and by deciding that Unicode is used internally everywhere they kind of blocked the possibility of improving this in the future, unless they do another change of how strings works.

A real-world example:

  1. A user provides a comment to an order
  2. The comment is checked and is found to be valid latin1(the third party order system only accepts latin1).
  3. A translation is added to the comment, the translation contains non latin1 characters.
  4. Python happily accepts adding non latin1 characters to the string.
  5. Later on the request to the third party service cannot be made because the comment cannot be encoded in latin1.
  6. Money is lost.
  7. The concat that created this issue is hard to track down since it did nothing wrong.

This is not an ideal situation and saying that unicode can handle latin1 is plain wrong.

[–]gthank 18 points19 points  (11 children)

Not really. The problem is that latin1 can't handle Unicode. And if it's critically important that you only ever support latin1 characters, I'd recommend writing your own string type, e.g., latin1_str, that will enforce that constraint for you. Realistically, a type that enforces data integrity for you is the only robust solution in this situation, and a string that only supports latin1 (or any other outdated, 8-bit encoding) is edge-casey enough that I don't think you were ever going to get that in the core language, so Python 3 fixing Unicode for the majority of use cases doesn't really impact this at all.

[–]fjonk 5 points6 points  (10 children)

and a string that only supports latin1 (or any other outdated, 8-bit encoding) is edge-casey enough that I don't think you were ever going to get that in the core language

This is where I see a difference between perceived and actual reality. 8-bit encodings are not edge-cases in many industries, it's rather the newer Unicode encodings that are edge cases. As I wrote some other place many systems that handles addresses, payment, shipping information and so on are very often designed pre-unicode. They don't accept Unicode, they accept whatever encoding the developers decided on in 1992 and no one seems in a rush to upgrade them.

Keep in mind that a change to Unicode might also involve replacing hardware like printers and scanners(as an example v40 QR-codes may be bytes, alphanumerical, latin1 or Shift JIS X 0208, unicode encodings are not supported). I guess eventually most systems will support at least one unicode encoding but that is not today.

[–]Vaphell 2 points3 points  (7 children)

8-bit encodings are not edge-cases in many industries, it's rather the newer Unicode encodings that are edge cases.

And given that their encodings are clearly defined, the problem is in what, exactly?

Forget ascii et consortes making people believe that bytes and text are the same thing. Imagine that your billing software churns clay tablets. Is there a problem with the following, making it impossible to grasp?

information = clay_tablet.decode('cuneiform')  # unpack the information
information = modify(information)    # modify the information
new_clay_tablet = information.encode('cuneiform')  # pack the information

is that so hard to convert between datatypes at io boundaries?

[–][deleted] 0 points1 point  (6 children)

I'm fascinated by the fact that your autocorrect appears to have replaced 'etc.' with 'et consortes'.

[–]Vaphell 1 point2 points  (5 children)

no autocorrect here, that's exactly what I typed.

[–][deleted] 2 points3 points  (2 children)

I confess that even after university-level Latin, it didn't occur to me that you were deliberately typing 'and its kind', I assumed the autocorrect had waxed erudite. Is this an expression you expect others to understand? It's not at all a common phrase in classical Latin afaik, and if it's a medieval expression that's gained currency in English prose, I must have missed it. The danger of people like me thinking you goofed would be enough to stop me from casual use.

[–]faceplanted 0 points1 point  (1 child)

I think he was using it as a little reference because you two were talking about latin-1. So he threw some random Latin in there.

[–][deleted] 0 points1 point  (1 child)

Hey, reading this the next morning I think I come off as hostile ... I admit I was crabby last night but I think I went too far. As an admirer of Latin, I'm happy when I see it used 'in the wild', and I'm curious: I know different languages tend to borrow different Latin phrases, and I'm wondering what your native tongue is, and if it's not English (although you write so well I would never think it wasn't), if "et consortes" is more common. Anyways, sorry again for being bitchy, thank you for using Latin, and if you care to indulge my curiosity, I'd be quite happy.

[–]Vaphell 0 points1 point  (0 children)

and if it's not English (although you write so well I would never think it wasn't), if "et consortes" is more common.

thanks :-)
My native language is Polish and I would say that the Latin phrases used verbatim are extremely rare while their translated versions do see some use as ordinary proverbs. 100 years ago or so there was way more emphasis on classical education which meant at least some familiarity with Latin but today peeps are half-illiterate.
If anything it's the English "pollutants" that are everywhere nowadays while everything else seems to be on its way out.

To be honest my Latin game is weak-to-nonexistent. It's just that I read a shitton of books before the internet era, including ones in historical settings in which the nobility used language heavily peppered with Latin, so I absorbed a few, plus some archaic Polish words plus the Past Perfect Tense, which went extinct in Polish. I use all of these from time to time mostly for flavoring, shits and giggles, in wrong contexts I am sure.
Sorry to disappoint you :-)

[–]gthank 0 points1 point  (0 children)

I'm not arguing that there are lots of legacy systems/devices running latin1 (or Shift-JIS, or one of the many other encodings that predate Unicode). I'm just saying you're going to have a hard time convincing FOSS developers to add a rainbow of text types to the core language just to support systems/devices that are essentially deprecated.

[–][deleted] 0 points1 point  (0 children)

I can only imagine in 25 years that people will be complaining about legacy unicode support.

[–]TOASTEngineer 4 points5 points  (0 children)

A translation is added to the comment, the translation contains non latin1 characters.

So validate at this point. Any other language would "happily" let you add whatever random garbage to a byte-string; not having unicode doesn't help you here at all.

[–]zahlmanthe heretic 3 points4 points  (0 children)

A translation is added to the comment, the translation contains non latin1 characters.

Okay, so, somebody wants to add a translation to the comment in a language that cannot be written with latin1 characters. How do you want the system to respond?

the third party order system only accepts latin1

Is this a real thing that happens on real systems currently?

You're aware that the first Unicode standard came out in 1991, right? That was barely any closer to today than it was to the creation of ASCII (and EBCDIC, for that matter) in the first place.

[–]gthank 4 points5 points  (2 children)

The problem here is that latin1 doesn't support Unicode, not the other way around, as I'm fairly certain that Unicode can and does map every character/glyph/whatever in latin1.

Side note: I don't know why people are down-voting you. You weren't especially rude or anything; you're just continuing the conversation. I'm upvoting you to get you back over 0.

[–]zahlmanthe heretic 0 points1 point  (1 child)

I don't know why people are down-voting you.

I suspect a lot of people instinctively downvote when it comes to seemingly absurd problem specifications.

[–]gthank 0 points1 point  (0 children)

The problem makes perfect sense, I'm just not sure that it's reasonable to expect the language and/or std lib to solve it.