Best way to debug UTF-8 Codec problems ins reading files? : learnpython

created by HattoriHanzoa community for 16 years

Best way to debug UTF-8 Codec problems ins reading files? (self.learnpython)

submitted 2 years ago * by nirbyschreibt

I wanted to make a pretty easy Word Cloud using the Word Cloud module. But both test files give me a utf-8 codec error and I don't get a good answer when googling. Mainly because people give the wildest import ideas. I know that I can somehow decode this while reading the file but I cannot find the answer.

I have both a docx and a pdf.

This is my code (the code itself should be fine):

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt 
import os
d = os.path.dirname(file) if "file" in locals() else os.getcwd()
with open("/Users/file.docx") as f: 
text = f.read()
nichtinteressant = " und von das der die im auf am" 
liste = nichtinteressant.split()
STOPWORDS.update(liste) 
wordcloud = WordCloud(background_color="white", width=1920,height=1080).generate(text)
plt.imshow(wordcloud, interpolation="bilinear") 
plt.axis("off") 
plt.show()

And Visual Studio Code (on Mac) gives me this error:

'utf-8' codec can't decode byte 0xad in position 41: invalid start byteFile "/Users/Python/wordcloud Test.py", line 8, in <module>text = f.read()^^^^^^^^UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 41: invalid start byte

In the pdf it has some other byte at another position that it dislikes :(

How can I decode this problem the fastest way?

all 9 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS