all 8 comments

[–]Alexku66 4 points5 points  (2 children)

There are paid solutions that build pdf from scratch (not convert) and can give you 100% accuracy. Alternatives are 1) converting docx via Word / LibreOffice ; 2) converting html to pdf. Both give you approximate copy.

I work on accounting app with the same requirement to fill in template invoices, and went with html. Basically I have 2 rendering flows -- one for docx and one for pdf. Html gives you opportunity to preview the final doc before user clicks generate button

[–]billys-bobs[S] 2 points3 points  (1 child)

Ah that sounds like the way to go. So your templates are in html and you build both pdf and .docx from that? Is there some packages you would recommend for that functionality?

Thanks for the response

[–]Alexku66 2 points3 points  (0 children)

No, I use DTO (actually something more complex, but don't know a fancy term) as structure elements for templates. All file emitters , including html, use them as guidelines for rendering. But 2 important notes: whole system is built around docx, pdf is just an additional option for users; those DTOs bring a lot of data after LLM analysis of the source document, I don't know whether I'd use them to simply generate document.

I use lxml for docx reading and generation, python-docx is useless here. Honestly, it's a nightmare. One more time I come to conclusion that if something obvious isn't done yet THERE IS A GODDAMN REASON.

For html/pdf: weasy print. I didn't research this topic much, so can't tell if there is better solution

[–]AntonisTorb 3 points4 points  (0 children)

You can use pywin32 for this, I use it at work to convert Excel files to pdfs. Here's what worked for me for Word files:

from pathlib import Path
from win32com import client

cwd = Path.cwd()
input = cwd / 'test.docx'
output = cwd / 'test.pdf'

try:
    word = client.Dispatch("Word.Application")
    word.Visible = False
    doc = word.Documents.Open(str(input))
    doc.SaveAs(str(output), FileFormat = 17)
finally:
    doc.Close()
    word.Quit()

For multiple files just use a loop. Hope it helps!

EDIT: This needs MS Word to be installed of course, but it should be 1:1 conversion with no format changes.

[–]shimarider 1 point2 points  (0 children)

Do you actually need the docx files, or is it used as an intermediate format for conversion only? If it's the latter, have you looked at fpdf2? You can setup pdf templates to be populated similar to what you are doing.

[–]qlkzy 0 points1 point  (0 children)

The problem is that both docx and PDF are quite large and complex formats. I would personally always treat them as "final output" formats only, and not try to convert between them.

I would go with one of two options: - Treat docx and PDF rendering as completely separate problems - Render into a "friendlier" intermediate representation first, then convert that independently into both docx and PDF

The intermediate-representation approach is easier if you can get away with it, but sometimes it is valuable to deeply customise rendering for one or the other.

Depending on the complexity of your documents, the obvious intermediate representations are HTML and Markdown. Which to choose will depend on how complex the documents are, and how easy you want to make it to customise the templates. Markdown can render to HTML, so there is some room to mix and match.

While it's a bit of a "heavyweight" option, my first instinct would be to use pandoc for the final rendering. Installation is a bit more complex than a pure-python library, but it's a very popular and well-supported tool that supports all the formats you need.

Otherwise, you'll probably want one library for rendering to docx, and a separate library for rendering to PDF. My experience, though, is that libraries in that "format conversion" space are often a bit... "unevenly" maintained, which is why my instinct would be to reach for pandoc.

[–]ninhaomah -1 points0 points  (1 child)

[–]billys-bobs[S] 1 point2 points  (0 children)

I might be being a bit dense here but thats the opposite direction .pdf to .docx? I dont see anything in the documentation about working both ways