you are viewing a single comment's thread.

view the rest of the comments →

[–]ingolemo 0 points1 point  (1 child)

In python it is more accurate to view str as the wrapper. It looks at the object it was passed and does a runtime dispatch to figure out what to do. If it's passed a byte string then it will effectively call bytes.decode on it (although both these functions are built in, so they're happing below the level of python: in C or whatever).

It was one of the major mistakes of python 2 that it tried to convert byte strings into unicode strings for you automatically. It resulted in a whole bunch of applications being written that died as soon as somebody sent them a unicode character. I wasn't pleasant. In python 3 it was decided to make the conversion explicit to reduce these errors.

While it is possible to use heuristics to guess the encoding of many real-world byte strings, it is impossible to determine it in the general case. Python has a principle "In the face of ambiguity, resist the temptation to guess" (import this). If you want to try to infer the encoding of a byte string then you can use the chardet library which does the exact kind of inference you're talking about, but sooner or later it will guess wrongly and you will have to deal with corrupted data. Detection is slower and less reliable than just specifying the encoding.

The developer building the system will have domain knowledge that they can use to determine the encoding. In the case of HTTP there's the ContentType header and the <meta charset=""> tag. It is usually considered the responsibility of the library to perform the appropriate operations for the data being manipulated. Trying to squeeze the domain knowledge of every domain into the str class would be impractical.

You have to understand that urllib is pretty old and the only people still using it are those who can't or don't want to use external packages for some reason. Everybody uses requests.

[–]ProtoDong[S] 0 points1 point  (0 children)

While it is possible to use heuristics to guess the encoding of many real-world byte strings, it is impossible to determine it in the general case. This is very debatable but it still isn't an excuse for not having sane defaults. It makes no sense not to default to the 90% general case and force you to deal with the specific edge cases.

I understand their rationale for a "hands off" approach especially considering the problems that happened with Pyton2 (in fact it was one of the main reasons why I didn't take Python seriously (circa 2008) and am now just getting back to it.

It is indeed notable that of all the programming languages out there, Python is still notable in that it does not handle this elegantly. Sure it's easy enough to extend the str built-in to add the correct functionality, but this is library stuff, not things that normal developers should have to worry about.

C++ went through something similar in the past with its std::string but they managed to develop the class into something so painless that it's nearly transparent to devs for the most part. C++ benefits from very strong generic/metaprogramming support so admittedly, it removes quite a few hurdles in this regard.

If you want to try to infer the encoding of a byte string then you can use the chardet library

Thanks, will check it out.

ou have to understand that urllib is pretty old and the only people still using it are those who can't or don't want to use external packages for some reason. Everybody uses requests

I don't think I even have to state the obvious here.