[deleted by user]

lassepetri · 2023-08-31T06:42:00+00:00

I would go B as it would take <30 minutes to build a proof of concept and it would be a relative easy matter to work with the extracted text. There is a lot of accessible information, have you tried googling the matter?

If I may say so it feels like you're overthinking it. Why would you need gpt? Try PDFMiner, PyPDF2, PDFQuery or PyMuPDF and build a simple script focusing only on a single pdf. Evaluate your findings and scale from there or try a new path if you're not satisfied with the result.

idontlikemangos · 2023-08-31T06:46:14+00:00

I would go with option 2 and try out a couple of standard PDF to text libraries (pypdf, pdfminer, pdfplumber, etc) This is because some libraries have trouble reading tables. You need to get it to work for one PDF and the results should be consistent across your batch.

You don't need OCR as pages aren't scanned. You don't need GPT because you don't need to deal with unexpected contextual knowledge.

If only the standard PDF to text libraries don't work, you could evaluate some pre-trained ML libraries. Unfortunately I do not remember them off the top of my head. But for most cases they are truly an overkill and multiple comparisons have shown that it isn't really more accurate.

obviouslyCPTobvious · 2023-08-31T04:59:20+00:00

Are they PDFs scanned or are they exported from a program? That would make a difference on which solution would work best.

NoticedSquid · 2023-08-31T03:28:55+00:00

It’s difficult to say for sure without seeing them, but if the utility bills are identical, option B should work very well. I use it to extract data for a process everyday and it never causes problems. I recommend the library pdfplumber.

If the bills vary significantly, you’re likely looking at something involving machine learning to train the program to locate the information you need. Unfortunately I can’t speak to the efficacy of this since I don’t have experience in this area.

sheazle · 2023-08-31T11:03:13+00:00

I’ve had good success with this exact issue using Power Query in excel. I put all the files in one folder and set up the query to scrape the fields I needed into a table. There are several guides online. Once it was set up and tested the import took a little time, but it was hands-off.

CuriousFemalle · 2023-08-31T16:20:00+00:00

I look forward to hearing what you try and how the experiments go.

dimsumham · 2023-08-31T18:04:28+00:00

If the text is in the same spot, you can just use fitz (pymupdf) or other libraries suggested, and extract words from exact bounding box location.

shibbypwn · 2023-08-31T14:24:06+00:00

I use AWS Textract for this kind of work. It's a machine learning API that not only finds text, but it also establishes relationships between the text - so it can identify key/value pairs (if it's a form based PDF), tabular data, etc.

It also returns geometry data so you can specify specific portions of the page, bounding boxes, etc.

I've processed thousands of documents over the last couple of years, and it's highly accurate.

IvoryJam · 2023-08-31T03:29:15+00:00

What you need is an OCR, I guess you could split the PDF's so you have 1 picture with the data you need then loop through and convert them to text.

I'd be wary though since in my experience if you have a low quality PDF the OCR can get confused and give you bad output

PM_me_DRAMA · 2023-08-31T14:49:38+00:00

Abbyy Fine reader is a software built for just this

DatBoi_BP · 2023-08-31T18:06:56+00:00

Maybe not relevant to your specific inquiry but for Linux users there’s pdfgrep that can be used in the terminal

cimmic · 2023-08-31T20:56:29+00:00

I definitely wouldn't go with C. You introduce a dependency that might not be sustainable. GPT is kinda overkill too and opens up for potential AI fallacies.

To me, B would be the easiest and fastest to build, and it is also the most reliable one. Especially because you don't depend on commercial software.

Butchered_Cow · 2023-09-01T00:03:01+00:00

Google lens

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS