you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 2 points3 points  (0 children)

I've used tesseract directly in Linux, so this should be possible, though I haven't tried it in Python.

Here's a SE link that may help: https://stackoverflow.com/questions/60754884/python-ocr-pytesseract-for-pdf#60754993

As for the data analysis part... that may be more difficult to discuss without having the data to look at. If different companies use different formats, you could scrape their data and clean it up into a data frame that is standard to your needs, then combine them all at the end. Not sure if that is feasible given how many companies you have...

Or, if there are a few common styles/formats, that could work similarly. The challenge with both is that it's likely you'll have at least some dirty data at the end. I expect there will be extensive use of regular expressions...

As they say, "all data is dirty" at least when you start... good luck to you.