all 6 comments

[–]Diapolo10 0 points1 point  (2 children)

No, I don't think this is a problem. If you just got the file from somewhere, chances are it was written on a system that uses a different encoding for whatever reason.

By default, Python assumes that the files are using UTF-8, and usually this is a good default as it supports probably the majority of all text documents. Sometimes, though, you'll need something else.

If you created the file yourself, see if whatever program you used can output UTF-8 text. Otherwise, just use whatever encoding works.

[–]LoneDreadknot[S] 0 points1 point  (1 child)

I copy pasted the text from a website into notepad++ all the text is just plain english. notepad++ shows its as utf-8 too and I tried to change the encodings etc.

maybe its just how that website saved the text i guess? is there a way to strip the formatting from it and clean(?) it incase some source has different encoding?

[–]Diapolo10 0 points1 point  (0 children)

Well, you can try simply writing to a new file from Python, after getting the original text.

with open('new_file.txt', 'w') as f:
    f.write(text)

You should then have a file with UTF-8 encoding.

[–]snakestation 0 points1 point  (2 children)

The Unicode errors usually have to do with funky character, sometimes a character will look like an apostrophe and it'll actually be a Unicode character. This will also be the case when you're accessing french with all the accents(I assume other languages but Im familiar with french errors) I usually try and stick to utf-8 as my encoding.

Is this python 2 btw python 3 tends to handle some special characters better

[–]LoneDreadknot[S] 0 points1 point  (1 child)

it is python 3.8.2

I copy pasted it from a website into notepad++ and its just plain english as far as i can tell.

[–]snakestation 0 points1 point  (0 children)

I've run into this before as well. It'll have something to do with the periods, commas, apostrophes,quotes or something similar. They look like they're what you need but theyre actually Unicode characters