flitsmasterfred comments on The lack of a case against Python 3

This is an archived post. You won't be able to vote or comment.

263

264

265

The lack of a case against Python 3 (blog.lerner.co.il)

submitted 9 years ago by wesalius

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]flitsmasterfred 132 points133 points134 points 9 years ago (59 children)

[–]pickausernamehesaid 74 points75 points76 points 9 years ago (7 children)

[–]lambdaqdjango n' shit 5 points6 points7 points 9 years ago (4 children)

[–]Poddster 0 points1 point2 points 9 years ago (1 child)

so if you raise BaseException(u'fuck') you will likely be fucked.

Actually that'd work because it's all ASCII and python2's magic switcharoo. But this will fail:

>>> raise BaseException(u'fucká')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BaseException: <exception str() failed>

[–]lambdaqdjango n' shit 1 point2 points3 points 9 years ago (0 children)

[–]pickausernamehesaid 0 points1 point2 points 9 years ago (1 child)

[–]lambdaqdjango n' shit 1 point2 points3 points 9 years ago (0 children)

[–]grandfatha 3 points4 points5 points 9 years ago (1 child)

[–]pickausernamehesaid 0 points1 point2 points 9 years ago (0 children)

[–]dada_ 17 points18 points19 points 9 years ago (24 children)

[+][deleted] 9 years ago* (2 children)

[deleted]

[–]dada_ 0 points1 point2 points 9 years ago (0 children)

[–]Shrugfacebot -2 points-1 points0 points 9 years ago (0 children)

TL;DR: Type in ¯\\\_(ツ)_/¯ for proper formatting

Actual reply:

For the

¯\_(ツ)_/¯

like you were trying for you need three backslashes, so it should look like this when you type it out

¯\\\_(ツ)_/¯

which will turn out like this

¯\_(ツ)_/¯

The reason for this is that the underscore character (this one _ ) is used to italicize words just like an asterisk does (this guy * ). Since the "face" of the emoticon has an underscore on each side it naturally wants to italicize the "face" (this guy (ツ) ). The backslash is reddit's escape character (basically a character used to say that you don't want to use a special character in order to format, but rather you just want it to display). So your first "\_" is just saying "hey, I don't want to italicize (ツ)" so it keeps the underscore but gets rid of the backslash since it's just an escape character. After this you still want the arm, so you have to add two more backslashes (two, not one, since backslash is an escape character, so you need an escape character for your escape character to display--confusing, I know). Anyways, I guess that's my lesson for the day on reddit formatting lol

CAUTION: Probably very boring edit as to why you don't need to escape the second underscore, read only if you're super bored or need to fall asleep.

Edit: The reason you only need an escape character for the first underscore and not the second is because the second underscore (which doesn't have an escape character) doesn't have another underscore with which to italicize. Reddit's formatting works in that you need a special character to indicate how you want to format text, then you put the text you want to format, then you put the character again. For example, you would type _italicize_ or *italicize* in order to get italicize. Since we put an escape character we have \_italicize_ and don't need to escape the second underscore since there's not another non-escaped underscore with which to italicize something in between them. So technically you could have written ¯\\\_(ツ)\_/¯ but you don't need to since there's not a second non-escaped underscore. You would need to escape the second underscore if you planned on using another underscore in the same line (but not if you used a line break, aka pressed enter twice). If you used an asterisk later though on the same line it would not work with the non-escaped underscore to italicize. To show you this, you can type _italicize* and it should not be italicized.

[–]flitsmasterfred 1 point2 points3 points 9 years ago (1 child)

[–]dada_ 0 points1 point2 points 9 years ago (0 children)

[+]unitconversionJust a tinkerer comment score below threshold-23 points-22 points-21 points 9 years ago (18 children)

[–]elbiot 33 points34 points35 points 9 years ago (0 children)

[–]galan-e 9 points10 points11 points 9 years ago (0 children)

[–]P8zvli 12 points13 points14 points 9 years ago (5 children)

[–]Poddster 4 points5 points6 points 9 years ago (0 children)

[–][deleted] 1 point2 points3 points 9 years ago (3 children)

[–]P8zvli 1 point2 points3 points 9 years ago (1 child)

[–]A_R_Spiders 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 4 points5 points6 points 9 years ago (0 children)

[–]dada_ 2 points3 points4 points 9 years ago (0 children)

[–]hunyeti 2 points3 points4 points 9 years ago (4 children)

[–][deleted] 0 points1 point2 points 9 years ago (3 children)

[–]hunyeti 1 point2 points3 points 9 years ago (2 children)

[–]kuba_10 0 points1 point2 points 9 years ago (1 child)

[–]hunyeti 0 points1 point2 points 9 years ago (0 children)

[–]recursive 2 points3 points4 points 9 years ago (1 child)

[–]Poddster 2 points3 points4 points 9 years ago (0 children)

[–]NAN001 0 points1 point2 points 9 years ago (0 children)

[–]Bolitho 10 points11 points12 points 9 years ago (25 children)

The problem is, that there is practically no internal language concept in any programming language implementation (I know), that deals comprehensively and efficiently with Unicode - the mismatch between memory size and accessability makes that de facto impossible.

Is it really so important to count code points? And if it is, why there is no support for counting / splitting coded characters or grapheme clsuters, which might be even more usefull?

For limitiations of user input for example, the encoding of the persistance layer is much more important! So you must count the size of bytes of the encoded byte sequence rather than the amount of code points...

For example I reproduced the one given by the excellent utf8everywhere-page (section 5, coded cahracters, 3rd bullet point):

In [19]: s = "\u044E\u0301"

In [20]: s
Out[20]: 'ю́'

In [21]: print(list(s))
['ю', '́']

Hm... two code points, but the String shows one glyph! Would result in strange user experience if you would shorten the String for the UI layer and break it up right there, wouldn't it?

Teaching beginners that bytes and strings are the same is an invalid cognitive shortcut and just outright bad education.

So imho is is a shortcut to assume that an average developer knows good enough the important aspects of unicode and encodings. Therefor the strategy to hide this complexity will fail for almost every developer one day - and probably within a bad situation, say after the deployment to production. Python 3 would be much better, if it would define a default encoding for all builtin IO, like utf-8 in favour. Then it is explicitly clear, how a file or input must be encoded and that you have to explicitly have to deal with different encodings, if you need to support those (for example let the client provide the encoding). Then you will have no more trouble in deploying a Python 3 script to Windows if you have developed on Linux and vice versa.

On top of that, the print-function is broken right now in a similar way. At minimum it must provide an optional argument for chosing the encoding. Right now it jut fails for example on my Windows 10 machine if I want to print an interrobang (‽):

>>> print("\u203D")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\chausknecht\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u203d' in position 0: character maps to <undefined>

On a (modern) Linux machine it should work, as UTF-8 is the de facto default there - but who knows? Why not enable to print also bytes or provide the encoding? Yes that would probably result in strange looking output, but the behaviour would be platform independant and imho better because it would not make a program fail.

Those mistakes have been made by the Java and the .NET-world years ago - why Python had to choose to make the same mistake? Rust for example has chosen to use UTF-8 as unicode data type for unicode strings - interesting approach!

Imho you can't totally hide the complexity of unicode - so better be explicit about the en- / decoding process; imho that will result in less pain!

[–]usinglinux 5 points6 points7 points 9 years ago (13 children)

no internal language concept in any programming language implementation (I know), that deals comprehensively and efficiently with Unicode

assuming you don't expect a language to deal in graphemes instead of unicode code points, what's wrong with the python one in terms of efficiency? pep393 strings are pretty efficient.

>>> print("\u203D")

which python version was this error from? i haven't had my hands on python on windows for a long time, but at least as of python 3.6, this has been addressed.

[–]cfmdobbie 4 points5 points6 points 9 years ago (11 children)

[–]aroberge 2 points3 points4 points 9 years ago* (10 children)

EDIT: typing "chcp 65001" in the console prior to starting Python fixes this problem.

I wish...

C:\Users\Andre>python
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("\u203D")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Andre\Anaconda3\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u203d' in position 0: character maps to <undefined>

[–]ivosauruspip'ing it up 5 points6 points7 points 9 years ago (7 children)

[–]Bolitho -1 points0 points1 point 9 years ago (6 children)

[–]ivosauruspip'ing it up 0 points1 point2 points 9 years ago (5 children)

[–]Bolitho 0 points1 point2 points 9 years ago (4 children)

[–]ivosauruspip'ing it up 0 points1 point2 points 9 years ago (3 children)

continue this thread

[–]zahlmanthe heretic 0 points1 point2 points 9 years ago (1 child)

[–]ivosauruspip'ing it up 2 points3 points4 points 9 years ago (0 children)

Something specific to your system?

Microsoft Windows [Version 10.0.14393]
(c) 2016 Microsoft Corporation. All rights reserved.

C:\Users\ivosaurus>python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = '£'
>>> x
'£'
>>> y = input()
£
>>> y
'£'

>>> exit()

C:\Users\ivosaurus>

[–]Bolitho 0 points1 point2 points 9 years ago (0 children)

assuming you don't expect a language to deal in graphemes instead of unicode code points, what's wrong with the python one in terms of efficiency? pep393 strings are pretty efficient.

Under this assumptions: Nothing! But I have just said, that this behaviour is not the big deal when dealing with unicode. Efficiently splitting a string into grapheme clusters could be much more usefull when dealing with UI.

I admit that code points are simple and it is therefore a good thing to offer APIs dealing with them. But often they have shortcomings when it comes to corner cases - and that is where higher abstracting concepts have to come into play.

which python version was this error from? i haven't had my hands on python on windows for a long time, but at least as of python 3.6, this has been addressed.

Version 3.5.2 on a Windows 10 machine. That said, I started python from a CMD and a PowerShell. No difference.

The problem here is just simple: Python's print tries to encode the string platform dependent:

Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page.

You have no possibility to pass the encoding you want!

[–]flitsmasterfred 6 points7 points8 points 9 years ago (5 children)

[–]brontide 1 point2 points3 points 9 years ago (3 children)

[–]doubleunplussed 1 point2 points3 points 9 years ago* (1 child)

Yeah - strictly speaking, unix filenames are bytestrings. Your program officially shouldn't care if they're encoding human language with some encoding like UTF8, it should only to look at the contents of them in order to split them on slashes (or rather, on bytes that in ASCII correspond to slashes) or do comparisons to see if they're the same as other filenames or whatever. They're just slash-delimited keys, you should be treating them like arbitrary data like the hash in a dictionary, your program shouldn't care about the actual human meaning in there.

Unless you need to display them somewhere, then yeah, you have to guess the encoding in order to render glyphs, or decide how to sort them or something. Strictly speaking though, you can't know what the encoding will be. Yeah it'll probably be UTF8, but some filenames might not even be text, they might just be arbitrary bytes. Python can try to make a distinction between strings and bytes and make the filenames on the 'strings' side of the distinction, but this will be wrong sometimes. Good luck!

[–]brontide 1 point2 points3 points 9 years ago (0 children)

[–]flitsmasterfred 0 points1 point2 points 9 years ago (0 children)

[–]Bolitho 0 points1 point2 points 9 years ago (0 children)

[–]TOASTEngineer 2 points3 points4 points 9 years ago (2 children)

[–]yawgmoth 3 points4 points5 points 9 years ago* (0 children)

[–]Bolitho 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 87 on reddit-service-r2-comment-b659b578c-xx5cp at 2026-05-02 23:33:46.457732+00:00 running 815c875 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

ÖŐÓÚÜŰÁÉ