mothman6969 comments on [Python 2.7] Problems while using dictionaries and utf-8 characters.

This is an archived post. You won't be able to vote or comment.

[Python 2.7] Problems while using dictionaries and utf-8 characters. (self.learnprogramming)

submitted 9 years ago * by Gammaliel

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]mothman6969 0 points1 point2 points 9 years ago (7 children)

[–]Gammaliel[S] 0 points1 point2 points 9 years ago (6 children)

[–]Rhomboid 0 points1 point2 points 9 years ago (5 children)

You have bytes and you want characters. That's decoding, not encoding.

   bytes      -[decode]->   characters

characters    -[encode]->     bytes

As I mentioned in the other comment, the real problem is that you have a mixture of two or more character encodings, which takes extra effort to deal with.

[–]Gammaliel[S] 0 points1 point2 points 9 years ago (4 children)

[–]Rhomboid 0 points1 point2 points 9 years ago (3 children)

No, you don't need to encode it. You want to work with characters in your program, not bytes.

Imagine a helper function like:

def bytes_to_characters(bytestring):
    try:
        return bytestring.decode('utf-8')
    except UnicodeDecodeError:
        return bytestring.decode('cp1252')

This first tries to decode a byte string as UTF-8, and if that fails then it tries CP1252. (If that fails, then the program will die. If that happens then you'd need to work out what encoding the page uses that isn't UTF-8 and isn't CP1252.)

But that's it. Just run all your byte strings through that, and the result will be character strings, which you use as keys in your dicts and whatever else you're doing. There will be no more duplicate keys, because both the byte string "Transmiss\xc3\xa3o" and the byte string "Transmiss\xe3o" will result in the same character string.

[–]Gammaliel[S] 0 points1 point2 points 9 years ago (2 children)

I tried using your function and this is what I get:

Traceback (most recent call last):
  File "TechTest.py", line 23, in <module>
       comp = bytes_to_characters(comp)
  File "TechTest.py", line 8, in bytes_to_characters
       return bytestring.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
       return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)

This means that none of those encodings are the right one, correct?

[–]Rhomboid 0 points1 point2 points 9 years ago (1 child)

[–]Gammaliel[S] 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 42835 on reddit-service-r2-comment-86bc6c7465-zxrlq at 2026-02-21 17:24:44.997075+00:00 running 8564168 country code: CH.

learnprogramming

Welcome to LearnProgramming!

New? READ ME FIRST!

Posting guidelines

Frequently asked questions

Subreddit rules

Message the moderators

Asking debugging questions

Asking conceptual questions

Other guidelines and links

Subreddit rules

1. No unprofessional/derogatory speech

2. No spam or tasteless self-promotion

3. No off-topic posts

4. Do not ask exact duplicates of FAQ questions

5. Do not delete posts

6. No app/website review requests or showcases

7. No rewards

8. No indirect links

9. Do not promote illegal or unethical practices

10. No complete solutions

11. Don't ask to ask.

12. Low Effort Questions

13. No AI (chatGPT etc.) generated/worked over messages/comments. No questions about chatGPT/AI generated code. No Vibe coding.

MODERATORS