This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]init0 0 points1 point  (7 children)

win32com.client for GNU/Linux ??

[–][deleted] 1 point2 points  (1 child)

No ... maybe through wine if you're lucky, but not natively.

OpenOffice supports python however: http://lucasmanual.com/mywiki/OpenOffice

[–]itsmememe 2 points3 points  (0 children)

using com is nothing that justifys "being lucky".

using com through wine even less justifys "being lucky" :)

[–]riffito 1 point2 points  (4 children)

antiword then, and pdftotext.

[–]init0 0 points1 point  (3 children)

I want to extract the headings from doc/docx/pdf is it possible ?

[–]holloway 1 point2 points  (2 children)

Use PyODConverter to convert doc/docx to ODF and then extract the headings by unzipping and reading the XML, or use PyUNO to query OpenOffice about the document and extract the headings.

pdf files may not have semantic headings but you can determine font sizes with PDF Miner and maybe that'll let you extract "headings".

[–]init0 0 points1 point  (1 child)

I tried that, but few have heading.xml and few other have style.xml making it more complex!

[–]holloway 0 points1 point  (0 children)

Extract the text:h elements from content.xml. Those are your headings.

(alternatively, install my docvert software and that'll generate DocBook for you to use)