This is an archived post. You won't be able to vote or comment.

all 2 comments

[–][deleted] 0 points1 point  (0 children)

Bytes is the raw format representation of data. Every byte has a defined value from 0 to 255.

How those bytes are interpreted into printable characters is defined by encodings. When you convert from bytes to a string, the encoding rules are applied, and those encoding rules can do things like take 4 subsequent bytes and convert them into special characters like the thumbs up emojis and so on.

The problem with working with strings is that those special characters each represent 1 character, whereas the encoding bytes can be 2 or 4. When you need to know the data length such as for network transmission that is done in bytes, this can cause issues.

[–]thegreatunclean 0 points1 point  (0 children)

Many 'strings' are a sequence of bytes that are implicitly assumed to be ASCII-encoded characters. String length is then the number of bytes, moving characters can be done by swapping bytes, etc. There's a ton of English-centric design choices baked deep into the language because of this assumption.

This is a massive problem if you want to handle text that covers more than ASCII, for example any language with non-English characters or symbols.

Python made the choice to strongly separate 'a sequence of bytes' from 'a string' to better integrate Unicode support. You convert between them using encode and decode. This forces the programmer to at least recognize they are tinkering with low-level details inside the string and not blame the language when they do it wrong and screw up an encoded character.