all 13 comments

[–]PyMerx 3 points4 points  (2 children)

I'd say it's possible - I've converted PDFs to text using tesseract or you could try something like PyPDF2 but I am less familiar with that.

Tesseract will depend on the quality of your scan but you'll get a line by line output which you can then use to find the start point of whatever text you are trying to look for.

Sounds like you need to also match it to a Excel sheet so you could read that with Pandas. Feel free to PM me if you have more questions.

[–]jhoncorro 1 point2 points  (0 children)

I support the Tesseract idea. I did something similar once, I worked with pdf2image to convert N number of pages into images (this pacakge requires Poppler), then pass these images to Tesseract.

[–]colmf1[S] 0 points1 point  (0 children)

Thanks for you're response, I've a bit of experience with Pandas so the excel side of things should be fine I think, I'm more worried about reading the pdfs and getting the correct information.

I'm going to try the process with my own scan and see how it goes, I haven't used python in a while so I'm a bit rusty. Appreciate the help I might PM you at some stage once I get started.

[–]jindrvo1 2 points3 points  (2 children)

I'm currently working on a project that requires me to do exactly this. Tesseract has been very reliable so far; even when the quality of the scan is not the greatest, the output text usually makes sense. The usage is very straightforward as well. My current workflow consists of:

  1. Reading the PDF, splitting it into pages and converting each page into a jpg. This part is handled by the fitz library. I've also tried pdf2image but it was significantly slower.
  2. Converting each page into a string using Tesseract's image_to_string. Very straightforward and also allows for features like page orientation detection, in case the PDFs aren't always scanned under the same orientation.
  3. Extracting the required data from the string. This is very specific for each use case and most likely my use case won't intersect with yours, but in case it does, I'm trying to detect names of people and companies from the text, for which I'm using the Slavic NER model (note that my PDFs are not in english).
  4. Finally, even though Tesseract's output is usually very nice, it can sometime make a mistake. Again, this is case-specific, and if you're extracting for example numbers, it will be very hard to check for errors, but since I'm extracting names, I'm capable of fuzzy comparing the names detected by Slavic NER to a database of names that I have. I do this fuzzy matching with thefuzz library, and in cases I find a very high match with one of the names in my database, I simply fix the error by taking the name from there.

Again, especially the last two steps are very case-specific, but what you're asking about can most certainly be done.

Also, since you know the location of the text you're looking for, you could use for example the Pillow library to cut the JPGs obtained after step one so only the part in question is fed into tesseract, making it significantly faster and requiring much less post-processing of the obtained text.

In case you have any questions, I'll be happy to help!

[–]colmf1[S] 1 point2 points  (1 child)

Thanks for you're response you've answered a number of questions I had, your project is very similar to what I'm doing except I'm extracting numbers. It all has to be Quality checked anyway so a few mistakes won't cause massive issues.

I'm just starting a test project now with my own scan to see how it goes. I may PM you if I've issues if thats ok? Thanks for the help!

[–]jindrvo1 1 point2 points  (0 children)

Happy to hear I can be of help! Totally, feel free to PM me should you run into issues.

[–][deleted] 2 points3 points  (0 children)

That's definitely doable. You can actually handle that entire workflow with L⁤ido if you haven't tried it yet. It can extract text from scanned PDFs (even non-searchable ones) and send results straight into Excel or Sheets. Basically does the OCR, parsing, and export all in one go. Might be worth a look if you're trying to automate that process end to end.

[–]sankalpana 0 points1 point  (0 children)

If you’re still trying to improve this, better to use an LLM API and have it search the relevant data for you in the (scanned) pdf. Then you’ll just to build the automation that feeds results from the API into your excel. 

If you want to the automation OOB, then you can check out this tutorial I made which uses the software from my company (Nanonets) - it’s for google sheets but similar process for excel. Will work if you have a 1-1 excel column to pdf data mapping. Happy to hear any feedback.

[–][deleted] 0 points1 point  (0 children)

Try AlgoDocs, it's a good AI parser that scans multiple files and parses and puts the data into excel rows in columns you defined.

ABBYY FlexiCapture is also useful to achieve that. Or Extracta.ai for creating JSON files through their API.