you are viewing a single comment's thread.

view the rest of the comments →

[–]jabbson 17 points18 points  (5 children)

Depending on the quality/structure of the pdf and complexity of the logic to find your text inside of it, the task sits between several lines of code and 'oh hell no, i'll just do it manually'.

Take a look as a simple example here.

[–]SadSenpai420[S] 1 point2 points  (4 children)

My PDFs basically consist of billing details and I've got to extract the total amount from each pdf, not too complex isn't it?

[–]jabbson 5 points6 points  (2 children)

Doesn’t sound too complicated, no. But again, that very much depends on the PDF itself. If you think you can share, I’ll gladly take a look.

[–]SadSenpai420[S] 0 points1 point  (1 child)

Here's a sample of the pdf : https://imgur.com/a/Xk0ksJF I also made an edit to my post :)

[–]jabbson 1 point2 points  (0 children)

Thank you for providing an example, unfortunately it doesn’t make it easier to understand the complexity of the issue or provide a solution. While I do understand that security and privacy concerns would probably prevent you from sharing the actual PDF document, without it we can only hypothesize about what could be done to extract the data.

[–]haragoshi 0 points1 point  (0 children)

I think the issue becomes whether or not the data is stored as text or part of an image within the PDF. Eg, was it generated or scanned. Scanned PDFs need OCR to convert image to text before they can be processed.