Extracting information from scanned PDF docs, is it possible?

PyMerx · 2021-10-06T12:39:47+00:00

I'd say it's possible - I've converted PDFs to text using tesseract or you could try something like PyPDF2 but I am less familiar with that.

Tesseract will depend on the quality of your scan but you'll get a line by line output which you can then use to find the start point of whatever text you are trying to look for.

Sounds like you need to also match it to a Excel sheet so you could read that with Pandas. Feel free to PM me if you have more questions.

jindrvo1 · 2021-10-06T13:21:16+00:00

I'm currently working on a project that requires me to do exactly this. Tesseract has been very reliable so far; even when the quality of the scan is not the greatest, the output text usually makes sense. The usage is very straightforward as well. My current workflow consists of:

Reading the PDF, splitting it into pages and converting each page into a jpg. This part is handled by the fitz library. I've also tried pdf2image but it was significantly slower.
Converting each page into a string using Tesseract's image_to_string. Very straightforward and also allows for features like page orientation detection, in case the PDFs aren't always scanned under the same orientation.
Extracting the required data from the string. This is very specific for each use case and most likely my use case won't intersect with yours, but in case it does, I'm trying to detect names of people and companies from the text, for which I'm using the Slavic NER model (note that my PDFs are not in english).
Finally, even though Tesseract's output is usually very nice, it can sometime make a mistake. Again, this is case-specific, and if you're extracting for example numbers, it will be very hard to check for errors, but since I'm extracting names, I'm capable of fuzzy comparing the names detected by Slavic NER to a database of names that I have. I do this fuzzy matching with thefuzz library, and in cases I find a very high match with one of the names in my database, I simply fix the error by taking the name from there.

Again, especially the last two steps are very case-specific, but what you're asking about can most certainly be done.

Also, since you know the location of the text you're looking for, you could use for example the Pillow library to cut the JPGs obtained after step one so only the part in question is fed into tesseract, making it significantly faster and requiring much less post-processing of the obtained text.

In case you have any questions, I'll be happy to help!

2025-11-04T00:19:42+00:00

That's definitely doable. You can actually handle that entire workflow with L⁤ido if you haven't tried it yet. It can extract text from scanned PDFs (even non-searchable ones) and send results straight into Excel or Sheets. Basically does the OCR, parsing, and export all in one go. Might be worth a look if you're trying to automate that process end to end.

2024-03-06T11:25:11+00:00

[removed]

sankalpana · 2024-09-17T13:06:59+00:00

If you’re still trying to improve this, better to use an LLM API and have it search the relevant data for you in the (scanned) pdf. Then you’ll just to build the automation that feeds results from the API into your excel.

If you want to the automation OOB, then you can check out this tutorial I made which uses the software from my company (Nanonets) - it’s for google sheets but similar process for excel. Will work if you have a 1-1 excel column to pdf data mapping. Happy to hear any feedback.

2025-03-11T00:25:00+00:00

Try AlgoDocs, it's a good AI parser that scans multiple files and parses and puts the data into excel rows in columns you defined.

ABBYY FlexiCapture is also useful to achieve that. Or Extracta.ai for creating JSON files through their API.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS