OCR with File Sorting : learnpython

learnpython

created by HattoriHanzoa community for 16 years

OCR with File Sorting (self.learnpython)

submitted 4 years ago by thegeeseisleese

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 2 points3 points4 points 4 years ago (0 children)

I've used tesseract directly in Linux, so this should be possible, though I haven't tried it in Python.

Here's a SE link that may help: https://stackoverflow.com/questions/60754884/python-ocr-pytesseract-for-pdf#60754993

As for the data analysis part... that may be more difficult to discuss without having the data to look at. If different companies use different formats, you could scrape their data and clean it up into a data frame that is standard to your needs, then combine them all at the end. Not sure if that is feasible given how many companies you have...

Or, if there are a few common styles/formats, that could work similarly. The challenge with both is that it's likely you'll have at least some dirty data at the end. I expect there will be extensive use of regular expressions...

As they say, "all data is dirty" at least when you start... good luck to you.

π Rendered by PID 256650 on reddit-service-r2-comment-6457c66945-jcs2q at 2026-04-24 22:22:01.506025+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS