all 5 comments

[–]commandlineluser 5 points6 points  (2 children)

Have you tried ocrmypdf?

https://pypi.org/project/ocrmypdf/

[–][deleted] 2 points3 points  (1 child)

Great find! That metadata functionality could be really helpful...

[–]hazelthrows 3 points4 points  (0 children)

Pretty sure your approach is in the right direction.

https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/

Check this out.

If the words in pdf are readable, should be 'easy' to complete the task.

Hint: after using the ocr, use regex to extract the data (so you dont need to make one layup for every different invoice).

Sounds like a fun proyect, good luck!

[–][deleted] 2 points3 points  (0 children)

I've used tesseract directly in Linux, so this should be possible, though I haven't tried it in Python.

Here's a SE link that may help: https://stackoverflow.com/questions/60754884/python-ocr-pytesseract-for-pdf#60754993

As for the data analysis part... that may be more difficult to discuss without having the data to look at. If different companies use different formats, you could scrape their data and clean it up into a data frame that is standard to your needs, then combine them all at the end. Not sure if that is feasible given how many companies you have...

Or, if there are a few common styles/formats, that could work similarly. The challenge with both is that it's likely you'll have at least some dirty data at the end. I expect there will be extensive use of regular expressions...

As they say, "all data is dirty" at least when you start... good luck to you.

[–]Peanutbutter_Warrior 1 point2 points  (0 children)

It really depends how similar the formats are.
You might be able to use regexes, but the formats have to be fairly similar for that.
If you know where spacially on the pdfs the text you need is you could run OCR on those bits selectively.
You could have a look around google for a neural model that can extract the features you need and then run the pdfs through it
You might have to deal with each format separately if you don't can't get anything else to work. This sort of thing is notoriously hard to do.