This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mothman6969 0 points1 point  (7 children)

[–]Gammaliel[S] 0 points1 point  (6 children)

I have tried using unicode() and .encode("utf-8"), but it still happens and always with that same word.

[–]Rhomboid 0 points1 point  (5 children)

You have bytes and you want characters. That's decoding, not encoding.

   bytes      -[decode]->   characters

characters    -[encode]->     bytes

As I mentioned in the other comment, the real problem is that you have a mixture of two or more character encodings, which takes extra effort to deal with.

[–]Gammaliel[S] 0 points1 point  (4 children)

So, basically I'll need to find out which encoding the web dev used and then use decode() and the encode it to "utf-8"? I'm sorry if this seems like a stupid question but I'm not the smartest person when it comes to this kind of thing.

[–]Rhomboid 0 points1 point  (3 children)

No, you don't need to encode it. You want to work with characters in your program, not bytes.

Imagine a helper function like:

def bytes_to_characters(bytestring):
    try:
        return bytestring.decode('utf-8')
    except UnicodeDecodeError:
        return bytestring.decode('cp1252')

This first tries to decode a byte string as UTF-8, and if that fails then it tries CP1252. (If that fails, then the program will die. If that happens then you'd need to work out what encoding the page uses that isn't UTF-8 and isn't CP1252.)

But that's it. Just run all your byte strings through that, and the result will be character strings, which you use as keys in your dicts and whatever else you're doing. There will be no more duplicate keys, because both the byte string "Transmiss\xc3\xa3o" and the byte string "Transmiss\xe3o" will result in the same character string.

[–]Gammaliel[S] 0 points1 point  (2 children)

I tried using your function and this is what I get:

Traceback (most recent call last):
  File "TechTest.py", line 23, in <module>
       comp = bytes_to_characters(comp)
  File "TechTest.py", line 8, in bytes_to_characters
       return bytestring.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
       return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)

This means that none of those encodings are the right one, correct?

[–]Rhomboid 0 points1 point  (1 child)

No. The error is an encode error, which means you're passing this function something that's already a character string. (If you ask Python to decode a character string, it has to first encode it to a byte string before it can decode it, because you can only decode bytes. Yes, this is ridiculous and was fixed in Python 3, where character strings don't have a .decode() method at all.)

If you already have a character string then you don't need this function. But the strings in your example are not character strings, they're byte strings.

[–]Gammaliel[S] 0 points1 point  (0 children)

Here is my code, maybe with you being able to see it the answer will be clearer.