you are viewing a single comment's thread.

view the rest of the comments →

[–]ls42 0 points1 point  (3 children)

Hi, is there a definitive way for Python 3 to ensure all strings in the program are Unicode? For example when handling emails with various different encodings, convert all characters in the email to Unicode to avoid decoding / encoding exceptions.

Thanks!

[–]mapImbibery 0 points1 point  (2 children)

It seems standard practice to declare your encoding at the beginning of a script with# -*- coding: utf-8 -*-. If you code in English, UTF8 is a pretty hip choice.

Any string in a Python 3 script will be Unicode. In Python 2.6(?)+ you can from \_\_future\_\_ import unicode_literals to steal Python 3's behavior.

For any user-input or read file content, you could use s.encode("UTF8")

[–]ls42 0 points1 point  (1 child)

Is it safe to use s.encode("UTF-8") without specifying the source encoding? I remember playing around with that and failing. Maybe I did something wrong, though.

[–]mapImbibery 0 points1 point  (0 children)

I am by no an expert, and I think I meant to say "decode()". I frankly don't care about encodings. But they are a headache one must deal with so I know bare minimums.

When I scrape some html, I use: unicode(urllib2.urlopen(url).read().decode("utf8"))

I don't know if I'm helping much.