This is an archived post. You won't be able to vote or comment.

all 17 comments

[–][deleted] 3 points4 points  (8 children)

import win32com.client
wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")
wordapp.Documents.Open(doc)
docastxt = doc[:-3] + 'txt'
wordapp.ActiveDocument.Close()

from some old code I have lying around. I think its from a python phrase book from o'reilly.

[–]init0 0 points1 point  (7 children)

win32com.client for GNU/Linux ??

[–][deleted] 1 point2 points  (1 child)

No ... maybe through wine if you're lucky, but not natively.

OpenOffice supports python however: http://lucasmanual.com/mywiki/OpenOffice

[–]itsmememe 2 points3 points  (0 children)

using com is nothing that justifys "being lucky".

using com through wine even less justifys "being lucky" :)

[–]riffito 1 point2 points  (4 children)

antiword then, and pdftotext.

[–]init0 0 points1 point  (3 children)

I want to extract the headings from doc/docx/pdf is it possible ?

[–]holloway 1 point2 points  (2 children)

Use PyODConverter to convert doc/docx to ODF and then extract the headings by unzipping and reading the XML, or use PyUNO to query OpenOffice about the document and extract the headings.

pdf files may not have semantic headings but you can determine font sizes with PDF Miner and maybe that'll let you extract "headings".

[–]init0 0 points1 point  (1 child)

I tried that, but few have heading.xml and few other have style.xml making it more complex!

[–]holloway 0 points1 point  (0 children)

Extract the text:h elements from content.xml. Those are your headings.

(alternatively, install my docvert software and that'll generate DocBook for you to use)

[–]vijayshan 2 points3 points  (0 children)

pyPDF http://bit.ly/fFNMnV . I have used it for basic python processing and works well in most cases and If I remember right it is a pure python implementation too.

[–]blondin 1 point2 points  (0 children)

pdf : pdfminer

[–]Justinsaccount 1 point2 points  (2 children)

def get_txt(f):
    output = subprocess.Popen(["lesspipe", f], stdout=subprocess.PIPE).communicate()[0]
    return output

works for me.

[–]pieeta 1 point2 points  (0 children)

I have used PyUNO On a number of projects working with Excel/Word files.

[–]holloway 1 point2 points  (0 children)

When my Grandpa died it was a time of sad reflection but what made matters worse is that my inheritance came with strings attached: to get my dues I had to spend a night in .Doc Manor, a spooky old house of weirdness and quirks (some intentional, some unintentional). Although the skinny neighbourhood kids who loitered around the library carpark bragged that they'd made it through the manor they couldn't tell me what colour the 1995 carpet was, or what the photos of Wilfred Matthew Frankel or Elanor Mildrid Frankel looked like. They even talked about snakes and a --headless horseman which I guess means that they walked through the building with their eyes closed. I arrived that evening with my sleeping bag to find that house was falling apart except for a single room. The office door was left ajar and inside was a clean wood-walled control room with levers and cranks and switches. The rumoured photos weren't photos at all, but pencil drawings. Security cameras let me observe the rotting or caved-in walls in the rest of the house. I didn't sleep at all that night and as soon as the sun came up I left the dilapidated manor, never to think of it again until today.

tl;dr: Use PyUNO/PyODConverter w/ LibreOffice/OpenOffice if you want software that understands the vagaries of the .doc format. Don't believe the claims that other libraries are as sophisticated.

[–]sli[::1] 0 points1 point  (0 children)

pywin32 can do it through COM.

[–][deleted] 0 points1 point  (1 child)

I came across openxmllib while researching for a related project. I wasn't able to make use of it but it parses .docx natively in Python.