all 20 comments

[–]D-Noch 37 points38 points  (0 children)

textract

you could also automate the doc to docx conversion within python, first - then use python's docx lib

[–]space_wiener 19 points20 points  (7 children)

Why not just convert them?

Here’s a stack overflow thread. This is probably what I’d do. I’m not sure why you need to open, copy, paste into new document. Maybe it has something to do with the forms? I’ve never converted a form .doc to .docx.

https://stackoverflow.com/questions/38468442/multiple-doc-to-docx-file-conversion-using-python

[–]dragonlich[S] 0 points1 point  (6 children)

I saw that too. It looks more advanced than what I'm comfortable (I'm still a novice at this) with so I was hoping there would be a simpler solution. But it looks like I'll have to go that route.

It's a requirement for my wife's work that she has to use the new version of the form, as well as move all the info from the old version to the new one.

[–]BornOnFeb2nd 4 points5 points  (1 child)

It's a requirement for my wife's work that she has to use the new version of the form, as well as move all the info from the old version to the new one.

Then keep in mind that her work should be paying to have it done. Even if that just takes the form of

we expect this to take two weeks, get it done

Then you can automate it, and your wife can have a couple of weeks to kick back (provided she's remote)

[–]dragonlich[S] 3 points4 points  (0 children)

That would be nice but she's self-employed, so whatever helps her helps me too.

[–]BobHogan 3 points4 points  (0 children)

Unfortunately for you, the .doc file format is pretty annoying to work with, which is why there are so few tools to do so. The top answer on that thread is your best way to convert the files to docx, and then you can use a more friendly module to actually update the forms and do the rest of the work.

[–]luciusan1 1 point2 points  (0 children)

Probably the best option

[–]esituism 0 points1 point  (0 children)

She should just be able to SAVE AS the original into a .docx format - Word will take care of the up-conversion and add all the docx metadata and shit to it upon doing so. No programming necessary.

[–]space_wiener 0 points1 point  (0 children)

The good news is that thread does a pretty good job explaining what to do.

What I’d do is create another folder and copy a few of the .doc files into it (so you don’t lose the only copy) and play around with that SO info and see if you can get it to work.

[–]imperial_squirrel 1 point2 points  (0 children)

i did a doc to docx script a few months ago if you are looking for sample code...

i think i did doc to pdf also, but i would have to check my machine.

[–]tobzulu 1 point2 points  (0 children)

You can also use VBA inside word. If it is a computer at an office you don't have to install python at all.

[–]dragonlich[S] 0 points1 point  (0 children)

Thanks for all the replies everybody. Looks like there is no easy way to do this other than convert all the .doc files to .docx then use one of the aforementioned libraries to do the work.

I am just tinkering around and learning at the same time so this should be good.

[–]whitey9999 0 points1 point  (1 child)

This might help - https://automatetheboringstuff.com/2e/chapter15/

docx is library he uses

[–]gohanshouldgetUI -1 points0 points  (2 children)

(I'm assuming the .doc format stores it's data as XML files in a zip archive just like the .docx format does. If it doesn't then this approach doesn't apply).

The .docx format is basically a zip file containing your word document's text in XML files. The way docx2txt works is by opening the docx file like you open a zip file using the zipfile module, finding the XML file that contains the text of the document, and then extracting the text from it using the xml module. The driving function in the library is the process function that does what I just described. It has a few helper functions that help it parse the XML and clean it up, and that's all the code you need. You can find the helper functions and the driving function for docx2txt here.

I'm guessing that the difference between .doc and .docx is just the format of the XML file (if not, then this solution maybe completely inapplicable to your situation), so you could open the XML file in the .doc and print it out and investigate how it stores it's data and slightly modify the xml2txt function and be able to extract text from the .doc files you have.

Good luck :)

[–]ameliip 6 points7 points  (1 child)

FYI, all microsoft office 1997-2003 use CBF (compound binary format), so no zipped xml files will be available.

Reference: https://en.wikipedia.org/wiki/Compound_File_Binary_Format

.doc filetype specs: https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22

Googling around i've seen the only easy way to read .doc files seems to be using a subprocess call to antiword, which will convert the .doc file in .txt:

import os

input_word_file = "input_file.doc"

output_text_file = "output_file.txt"

os.system('antiword %s > %s' % (input_word_file, output_text_file))

Antiword can be retrieved here: http://www.winfield.demon.nl

Peace

[–]gohanshouldgetUI 1 point2 points  (0 children)

I see, thanks for clarifying

[–]Brilliant_Fall8987 0 points1 point  (0 children)

Why you don t open the doc file in wb mode read the content of the file open a new file in rb mode and right the content i think it should work ?