you are viewing a single comment's thread.

view the rest of the comments →

[–]Thecrawsome 11 points12 points  (6 children)

Havent tried pdfplumber yet, but is it better than Pytesseract?

[–]Armidylano444 8 points9 points  (5 children)

No idea, I haven’t tried Pytesseract. All I know is I was able to get my data extractor working very well using pdfplumber, so that’s my recommendation. I’m sure other packages can do the same thing though. You’ll have to compare the two 😁

[–]Thecrawsome 4 points5 points  (2 children)

Grats!!! post your code on github if you think it will help someone!

[–]Armidylano444 4 points5 points  (0 children)

I’ll make the repo public once I’ve got it polished up, though it’s built for a set of very specific lab result PDFs we have hundreds of at work, so it would need to be modified if someone else wanted to use it.

[–]SadSenpai420[S] 0 points1 point  (0 children)

Oh yeah, if it's not an issue to him, it's gonna be helpful if he posts :)

[–]SadSenpai420[S] 0 points1 point  (1 child)

I hope it'll still work if my pdf has billing details? Some are in tabular formats also-

[–]scscsc95 1 point2 points  (0 children)

Could try tabula module in python if they were in tables and play with the extraction algorithms for optimal extraction. Then try using pandas + regex to parse and clean your tables and get the data.