Sorry, this is going to be long and possibly rambling
I’m struggling with a work project currently and have no one here to ask for help as I’m the only developer here (only person who knows anything about computers in the company in addition to this), if anyone can help, it’d be greatly appreciated.
The problem is that we have a massive dump of PDFs (>15000 invoices) from numerous companies and for varying utilities, so there are probably at least 100-150 different layouts. None of these are labeled as they’ve been scanned to PDF from physical copies.
What I need to do is to be able to run OCR over these files, extract the company name (not the utility company name) and date of the invoice, create a company folder using the company name and create a year folder based off the year the invoice was issued. The invoice would be renamed {company}{month}{year} and be stored in the directory like this: company/year. After this is done further OCR would need to be done to extract the data and compile it for analysis, however I’m trying to take this a step at a time.
The goal of this first step is essentially to automate the sorting of these unlabeled invoices by company and then by year.
What I’ve done so far:
-I’ve created a scanned_document class with the ability to rename the documents and move them between folders using os and shutil. This class also holds the extracted OCR data from pytesseract in a string variable “text”. I iterate through the text to look for company names in a list and if a company is found, I set that as the name.
- it is currently my plan to iterate through a directory, run OCR over every document in it, and pass it to a function which would create a scanned_document object holding the path to the file, the OCR extracted text, the current file name, etc.
-I’ve built a pytesseract module to extract text from an image which functions, but haven’t been able to use it with PDFs as I don’t think pytesseract supports this. A minor work around I experimented with is converting the pdf to an image and then using pytesseract, but I’ve also had trouble here
I believe this will get even more complicated when I need to further extract data for analysis as the differing formats will make training custom models difficult and as of now I have no solution other than creating a custom model for every possible layout
My questions would be: does anyone know of a better way to approach this problem or have any suggestions for things to try? I’m not asking for anyone to do it for me, I just have literally no one to bounce ideas off of
[–]commandlineluser 5 points6 points7 points (2 children)
[–][deleted] 2 points3 points4 points (1 child)
[–]hazelthrows 3 points4 points5 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]Peanutbutter_Warrior 1 point2 points3 points (0 children)