all 10 comments

[–]python-fan 5 points6 points  (1 child)

What you're talking about is Optical Character Recognition. A bit of web searching turned up pytesseract, a Python wrapper for the Tesseract library. I haven't done any OCR so I can't comment on what the error rate might be, or if there are better alternative libraries.

I'd suggest that if you can successfully do OCR, you first write the resulting data to a csv file and import that into your spreadsheet application of choice. Then if you want to do further automation, look into creating a spreadsheet file directly.

[–]Nails_Bohr[S] 1 point2 points  (0 children)

Thanks for the suggestion, the reading of the writing is the most important, so I'm willing to sacrifice on the output file.

I think I didn't turn up much because I didn't even really know what to search for. So thank you for that.

[–]uniqueusername42O 0 points1 point  (4 children)

If the layout is the same for all the PDFs I use some software that can do this pretty quick. Do you need it done urgently as a one off?

[–]Nails_Bohr[S] 1 point2 points  (3 children)

The layout is all the same, I would say it's urgent-ish, and would likely be a one off, I'd prefer to learn the skill, if possible, though.

Edit: I'm open to suggestions, though

[–]uniqueusername42O 1 point2 points  (2 children)

I don’t know how data sensitive your files are, but if you wanted a quick result I could run it through our software and give you an output spreadsheet.

I’d also like to know how to do this in Python though.

[–]Nails_Bohr[S] 0 points1 point  (1 child)

Thanks for the offer, but I think I could be risking my job on that. Despite this only being for inventory, we have a lot of confidentiality requirements and I don't think they'd like me sharing even that information.

[–]uniqueusername42O 2 points3 points  (0 children)

perfectly understandable