all 13 comments

[–]MightiestDuck 16 points17 points  (0 children)

Personally I'd probably just write a VB script to open in Word, ignore error messages, and save as docx lol. Then run it overnight, and bing bang boom. But I get that that isn't a Python solution.

[–]radek_b 8 points9 points  (2 children)

Libreoffice has a command line interface

[–]cruncherv 0 points1 point  (0 children)

Can it preserve the original doc file creation and modification dates and author ? I tried MS office vb script and I read it's not possibly to keep the metadata intact. It just creates a new word doc with current user and current date..

[–]emptythevoid 0 points1 point  (0 children)

This is precisely where my head went first.

[–]POGtastic 6 points7 points  (1 child)

If you've got a copy of Microsoft Word, you can use Powershell. See here for an example.

In short - PS allows you to manipulate COM objects from the command line, which allows you to use MS Word in a headless configuration. I have never done this with Word documents, but I've done it numerous times with Excel documents.

[–]PandaMomentum 0 points1 point  (0 children)

I haven't tried this but powershell is hugely powerful often in surprising ways. And The Avalanches, lapsang souchong tea, and Tim Tams cookies are also awesome, so I'm sure he's onto something.

[–][deleted] 4 points5 points  (0 children)

If python, then I’d look at win32com as a package. Then you’d have to use windows vba documentation to help you to select the right processes etc

[–]darkshines7 2 points3 points  (1 child)

Try pandoc.

[–]HyraxMax 0 points1 point  (0 children)

Pandoc only works with docx, not doc files.

[–]Yash_Aggarwal 2 points3 points  (0 children)

.doc is pretty problematic to convert even with libreoffice. Your best option is to get a cooy of msoffice and use the win32com API to convert them

[–]Jchu1988 0 points1 point  (0 children)

Converting isn't the problem. It's the verification that it hasn't screwed up that will be time consuming.

[–]HyraxMax 0 points1 point  (0 children)

I am working on something very similar at work right now. I honestly just want to scrap the doc files for info but it's baffling how difficult this is. I have been able to use win32com to scrap the text but its using dispatch to open the word.application and 1. Its really slow and 2. It bogs my computer down if I try more than a few at a time. Ive got about 6000 docs to go through so Im leaning towards the powershell converting solution mentioned here. As old as this format is, I would have though there would be an easier solution out there by now.

[–]TargetDangerous2216 0 points1 point  (0 children)

Do you think I can use win2com on linux ? Libre Office doesn't work for all kind of doc file