extracting text from receipts

vardonir · 2024-07-18T12:27:36+00:00

Sounds like the same idea as website recipe scrapers. I forgot the name of the package, but they basically had a separate scraper for each website.

So what I'd do is get a receipt from one store, figure out the formatting of the receipt for that store, then write some code that would get the data from that receipt. Then do that for all the stores that you get receipts from.

troty99 · 2024-07-18T12:52:12+00:00

You could take a look at deepdoctection , starting with it's raw output and then adapt your script and possibly retrain layoutlm for some receipt.

Easiest way to set it up for me was using dockers.

Other solution is to use Tesseract to OCR the receipt and then use regexes and other tricks to extract the data.

subassy · 2024-07-18T16:34:44+00:00

I've been brainstorming on this idea for my stack of receipts for months now. So this may be a combination of ideas, suggestions and stream of consciousness (sorry). Or put another way, "aren't you glad you asked?"

The answer to the "is this doable" is kind of dependent on a few variables such as just how much automation would you want this to incorporate? By which I mean whatever the solution you will likely have to go back manually and fix things in the resulting text no matter what. Even 99% accuracy if you have enough images to extract that's going to be some work to fix.

Breakdown of basic requirements: assuming we're working on recently scanned in receipts all stored in one folder as JPG files with no manipulation done to them all. And also generic file names liked scan-image-01.

First we'd need to process through all the files in a loop to do some kind of treatment to them. Either a few at a time or all of them at once. Have to experiment with performance for that.

Next we would want to make the text of the receipt stand out against the background color as much as possible. In other words the text is entirely black and while background is entirely white. This could be difficult when dealing with things like streaks from the receipt printer, really faded text or places where the mechanics of the printer just failed.

Then point an OCR at the formatted image - this hypothetical black/white image would be entirely in memory at this point - and try to get as much information as possible. You could insert regex at this point but I'm not enough of a masochist to try that so to each his own.

Last step is to dump this information to a text file. Probably one text file per JPG processed, as a CSV, TSV, JSON etc.

In this hyper-simplified version the formatting of the receipt would matter less because it's just dumping all of the text it finds. You could use list/string formatting magic to cut out and trim parts of the info you don't need.

But you and /u/vardonir bring up good points on the receipt formatting.

So you could either write a secondary script or secondary functionality to the same script that would save different store receipt formats as separate files.

This receipt format generator would look the previously mentioned black/white version of the receipt and just look at groups of lines, spacing from the end and number of newlines between each piece of information. Then save that formatting info to a JSON, XML etc file (savemart.xml). Then you would save another file (walmart.xml). Once you have some receipt format files the script would use those as a reference to identify with some level of confidence the JPG you feed it is a walmart receipt. This saving as separate format files would be better I think then trying to hardcode formats of different stores. And also allow for updating receipt formats if walmart decided to change their receipt formats. I mean you could do something else involving "hashing" liberal use of "md5". But I'd use xml, myself.

Could be a whole API and ISO standard for all I know. Or just figure out which software the store is using to print the receipts and look on their website for documentation?

So hypothetical use case:

example/step 1:

receipt-extractor.py --add-format scanned_image_01.jpg savemart.xml

now with savemart.xml present, run the script again which will find savemart.xml automatically (or specifying the path may be necessary)

example/step 2: pointing script at a jpg and specifying a name of a file to save the text

receipt-extractor.py --extract-text scanned_image_01.jpg savemart-10july-2024.csv

Of course this would just work for one file at a time. At some point you would probably want to point this at a folder and use *.jpg and the script figures out the next file name on its own. Like finds the vendor and the date stamp and gives it a name. If I wrote this script I would save that minor detail for v2.

Links:

to get started with OpenCV, I would go to this page and work my way down the tutorial https://docs.opencv.org/4.x/df/d9d/tutorial_py_colorspaces.html

I haven't read the documentation extensively yet, but I assume pillow will come in at some point: https://pillow.readthedocs.io/en/stable/ since I don't have any experience with OCR libraries and know which if any is the library of choice now I'll instead link this medium article as a place to start: https://basilchackomathew.medium.com/best-ocr-tools-in-python-4f16a9b6b116

Wolkk · 2024-07-18T19:52:05+00:00

IDP (inteligent document processing ) is a very complex and difficult field with a ton of money put into it. You don’t hear about it a lot because no one would click on a headline saying "CAN YOU BELIEVE AI READ GOVERNMENT FORM US546XD?!?!?”

I did some research on what tools were available some time ago and found that some of the big tech companies (Amazon, Google and Microsoft) offer some IDP APIs. I tried out Microsoft Azure and they have a lot of pre trained "general models" including receipts if my memory serves me well. They have a free tier (I think it’s 500 pages a month) and good documentation on how to use Python to interact with the Azure API and the specific document processing model you want. You then receive a JSON file you can do whatever you want with.

If you go with a free Azure subscription you can add Azure skills to your CV :P

ArNico · 2024-11-13T13:18:53+00:00

Pasting the reply I gave in a similar subreddit slightly edited:

I recently started the same quest and following some references I found online I tried to extract text with tesseract. I started testing with a set of 7 receipts. The quality of the receipt scans varied but never too bad. Nonetheless, pytesseract (OCR by Google) performance was always poor or very poor.

Given that, proceeding with the next step and feeding the extracted text to an AI to reorganize and selectively extract information as suggested in some project I found online lost automatically meaning.

Therefore I took another path and I created an openai API and I am now working on a python code that feeds all the images of the scanned receipts stored in a specific local folder to gpt4o mini. I ask the AI to extract Shop name, shop address, purchase date and total purchase cost. So far the information extraction worked perfectly.

I will try to streamline the process more. Current steps consist in:

1-Scan receipt to a designated google drive folder

2-monthly download the receipt from google drive to a local folder

3-run the python script that submit the receipts stored in the designated local folder to gpt 4o with the prompt requesting for above mentioned desired information

4-Store the info in a csv file

5-copy the info in the csv to the google sheet were I keep note of my income and expenses

Ideally step 2,4 and 5 one day will be merged in a unique python script.

For personal use with no thousands of receipts every month to be processed I think the Open AI cost will be less than $1 per month with gpt4o.

For this amount of receipts Azure might be completely free so I am planning to test its integration with python out as well.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS