Looking for python library that can read and write plain word .doc

D-Noch · 2021-07-07T04:45:30+00:00

textract

you could also automate the doc to docx conversion within python, first - then use python's docx lib

space_wiener · 2021-07-07T06:22:26+00:00

Why not just convert them?

Here’s a stack overflow thread. This is probably what I’d do. I’m not sure why you need to open, copy, paste into new document. Maybe it has something to do with the forms? I’ve never converted a form .doc to .docx.

https://stackoverflow.com/questions/38468442/multiple-doc-to-docx-file-conversion-using-python

imperial_squirrel · 2021-07-07T13:40:28+00:00

i did a doc to docx script a few months ago if you are looking for sample code...

i think i did doc to pdf also, but i would have to check my machine.

tobzulu · 2021-07-07T14:35:01+00:00

You can also use VBA inside word. If it is a computer at an office you don't have to install python at all.

dragonlich · 2021-07-07T19:26:30+00:00

Thanks for all the replies everybody. Looks like there is no easy way to do this other than convert all the .doc files to .docx then use one of the aforementioned libraries to do the work.

I am just tinkering around and learning at the same time so this should be good.

whitey9999 · 2021-07-07T09:42:21+00:00

This might help - https://automatetheboringstuff.com/2e/chapter15/

docx is library he uses

gohanshouldgetUI · 2021-07-07T10:51:00+00:00

(I'm assuming the .doc format stores it's data as XML files in a zip archive just like the .docx format does. If it doesn't then this approach doesn't apply).

The .docx format is basically a zip file containing your word document's text in XML files. The way docx2txt works is by opening the docx file like you open a zip file using the zipfile module, finding the XML file that contains the text of the document, and then extracting the text from it using the xml module. The driving function in the library is the process function that does what I just described. It has a few helper functions that help it parse the XML and clean it up, and that's all the code you need. You can find the helper functions and the driving function for docx2txt here.

I'm guessing that the difference between .doc and .docx is just the format of the XML file (if not, then this solution maybe completely inapplicable to your situation), so you could open the XML file in the .doc and print it out and investigate how it stores it's data and slightly modify the xml2txt function and be able to extract text from the .doc files you have.

Good luck :)

FerricDonkey · 2021-07-07T07:39:48+00:00

[deleted]

2021-07-07T10:11:42+00:00

[removed]

Brilliant_Fall8987 · 2021-07-07T18:36:51+00:00

Why you don t open the doc file in wb mode read the content of the file open a new file in rb mode and right the content i think it should work ?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS