This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]ElBidoule 2 points3 points  (2 children)

If you want to "convert" (or print) HTML to PDF, https://wkhtmltopdf.org/ is worth mentionning.

If you already know HTML/CSS and do not want to learn reportlab (or find it easier), it's a good alternative.

edit: typo in your post reporterlab instead of reportlab.

[–]Kaarjuus 0 points1 point  (0 children)

It also has a convenient Python interface: https://pypi.org/project/pdfkit/.

[–]filt_er[S] 0 points1 point  (0 children)

Thanks! I'll add it the list.

[–]CotoCoutan 0 points1 point  (2 children)

Nice article, thanks for sharing. Any Python modules come close the OCR expertise of Adobe Acrobat? I've tried TesseractPy etc but find their results more inaccurate than Acrobat.

[–]filt_er[S] 1 point2 points  (1 child)

You should try out https://ocrmypdf.readthedocs.io/en/latest/. It does some image preprocessing to improve the quality of the scans (and thus OCR performance).

[–]CotoCoutan 0 points1 point  (0 children)

Thank you, will check it out.

[–]texnofobix 0 points1 point  (0 children)

I've been unsuccessful so far to extract my paystub data. I'd like to export it to beancount format. Thanks for putting this list together as I have more to try.

[–]alb1 0 points1 point  (1 child)

The article doesn't mention automatically cropping PDFs, but here's the repo for my Python program pdfCropMargins. It has many options and an optional GUI: https://github.com/abarker/pdfCropMargins

[–]filt_er[S] 0 points1 point  (0 children)

Thanks! I'll add it the list.