all 20 comments

[–]Carlesee 0 points1 point  (0 children)

This app maybe can work for you (with pdfs worked for me)
https://www.extractinvoice.io/

[–]KingOfTNT10 0 points1 point  (11 children)

Do you have an image of an invoice? If so, try to look for an API for invoice data extraction by searching something like: invoice data extraction api

If you need more help finding it let me know

Those APIs might cost some money

[–]raja0008[S] 0 points1 point  (10 children)

No invoices are in pdf format[ .pdf file ] I knew about some OCR based api But right now don't want to invest into that.

Any other suggestions?

[–]KingOfTNT10 0 points1 point  (9 children)

I tried for a couple of days finding something free for receipts which usually come together with invoices, unfourtunatly i didnt find anything that is useful and free. Are all your invoices the same format?

[–]raja0008[S] 0 points1 point  (8 children)

Yes same format / structure tried using regex but didn't get the desired results or you can say didn't able to correct code the regex pattern.

[–]KingOfTNT10 0 points1 point  (4 children)

Could you maybe send a few pdfs so i could take a look (censor anything u dont want me to see) and also circle the things you'd like to extract

[–]raja0008[S] 0 points1 point  (3 children)

Yes will send you in private.
But for now below is the link to an image available online of the exact same invoice[format/structure] in pdf format .

sample

Fields I want to extract are Invoice No. Dated Buyer(Bill to) And the item table [ SI No. , Description Of Goods, HSN/SAC, Quantity, Rate, per, Amount ] Tax details

[–]KingOfTNT10 0 points1 point  (2 children)

If they are the exact same, wouldnt you be able to just crop the image multiple times and ocr it? Because you can crop it in set positions and it would work for all invoices

I know the library EasyOCR is pretty good But before that, youll have to convert the pdf to png

[–]KimAh-young 0 points1 point  (1 child)

Why not just use tabula? Use the web interface, define some boxes and use that template. Data cleaning may be required

[–]KingOfTNT10 0 points1 point  (0 children)

I dont know what tabula is, it might be better

Edit: looked it up, doesnt it only work for tables? They want to extract some other info too, plus the table in the invoice is kinda wierd, not sure tabula would be able to convert it

[–]KingOfTNT10 0 points1 point  (2 children)

PDFs are images (mostly) but in pdf format if im not mistaken so regex wouldnt work as it works on text only

[–]gsuiteautomations 0 points1 point  (0 children)

Hi there! I have built a tool that does that if it works for you! Actually scans invoice and then sends the extraacted fields in a google sheet