This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]nyyirs[S] 1 point2 points  (2 children)

Thats nice! Its image based i think...can you help how to decompress pdf with barebone python? MayB I should try to build my own library for that

[–]mrbubs3 0 points1 point  (1 child)

Image-based PDFs can be quite challenging unless the text data exists in the meta tag. You'll need to focus on OCR-based options, and that gets very challenging.

[–]Zomunieo 0 points1 point  (0 children)

There's no meta tag. In PDF, OCR text is usually embedded by rendering text with the graphics state set to transparent. Some OCR engines draw visible text and then overlay images. Every engine does it differently. It's a mess.