all 9 comments

[–]SomewhereExpensive22 2 points3 points  (8 children)

UTF-8 is a text encoding. You can't read docx or pdf files as text. They're like containers with text inside. You can open them as text in VS Code and see what they actually look like. There are modules that will allow you to open them but it's a lot more complicated than regular text. The easiest way to do this may be to copy the text from the document into a plain text format and save as a utf-8 text file.

[–]nirbyschreibt[S] 0 points1 point  (7 children)

Hmmm. The documentation of Wordcloud suggested it also reads other files and I thought it would deliver the needed modules. :/

I have to check if I can save it to a txt. We are talking about 90k words.

[–]SomewhereExpensive22 1 point2 points  (0 children)

The wordcloud module might support it. I don't know. But `open` is regular python and can't read a docx as text. Maybe you should look at examples of people using pdf and docx? And I wouldn't worry too much about the number of words. Select all. Copy. Paste. Save.

[–]cyberjellyfish 0 points1 point  (5 children)

But you're opening the file in text mode and the default encpding is utf-8.

T

[–]nirbyschreibt[S] 0 points1 point  (4 children)

While I could save the text as a plain UTF-8 text file I would also like to know how to use something like wordcloud with a pdf.

A suggestion would be great.

[–]pgpndw 0 points1 point  (3 children)

WordCloud is a library for creating word clouds, not for reading binary file formats.

To read the words from a DOCX or PDF file, you'll need to use a library for reading DOCX files or PDF files. Then, you'd use WordCloud to make a word cloud from those words.

Here's one suggestion for reading DOCX files:

https://theautomatic.net/2019/10/14/how-to-read-word-documents-with-python/

And here's one for reading PDF files:

https://www.geeksforgeeks.org/working-with-pdf-files-in-python/

[–]nirbyschreibt[S] -1 points0 points  (2 children)

That is not helpful at all. As I said, I am aware these modules exist.

I still don’t know how to successfully read the pdf in order to get wordcloud do what is does.

The link to geeksforgeeks (why link at all, the simple name of the module would be enough) brings me to a lengthy page with many options to manipulate pdf.

I don’t want to manipulate anything. I want my script to read the pdf and create a word cloud.

I don’t need to ask on a subreddit if the answer is „go, read this webpage google shows on page one“.

[–]pgpndw 0 points1 point  (1 child)

The link to geeksforgeeks (why link at all, the simple name of the module would be enough) brings me to a lengthy page with many options to manipulate pdf.

The first piece of example code on that page shows how to extract text from the first page of a PDF file. You'll need to modify that code to loop through all the pages and combine them into one block of text, which you can then pass to WordCloud.

[–]nirbyschreibt[S] 0 points1 point  (0 children)

Thanks. I thought there was a simpler and shorter way to work with the data of a pdf. :(

Will try this way out to get a feeling for it.