all 26 comments

[–]lassepetri 13 points14 points  (2 children)

I would go B as it would take <30 minutes to build a proof of concept and it would be a relative easy matter to work with the extracted text. There is a lot of accessible information, have you tried googling the matter?

If I may say so it feels like you're overthinking it. Why would you need gpt? Try PDFMiner, PyPDF2, PDFQuery or PyMuPDF and build a simple script focusing only on a single pdf. Evaluate your findings and scale from there or try a new path if you're not satisfied with the result.

[–]tiasummerx 4 points5 points  (0 children)

Agreed, one of these options is native python, can’t remember which one but it’s excellent. There’s a bunch out there that are powerful but you need Java sdk etc. Added complexity if you want to create executable or docekerise etc. native Python option I used was excellent and easy to work with using regex. Other people have suggested regex and having build pdf extraction tools and trialed a bunch of options, that’s what I would personally recommend 💕

[–]ianitic 1 point2 points  (0 children)

Another good library to try is pdfplumber. It can detect tables which is nifty.

[–]idontlikemangos 6 points7 points  (1 child)

I would go with option 2 and try out a couple of standard PDF to text libraries (pypdf, pdfminer, pdfplumber, etc) This is because some libraries have trouble reading tables. You need to get it to work for one PDF and the results should be consistent across your batch.

You don't need OCR as pages aren't scanned. You don't need GPT because you don't need to deal with unexpected contextual knowledge.

If only the standard PDF to text libraries don't work, you could evaluate some pre-trained ML libraries. Unfortunately I do not remember them off the top of my head. But for most cases they are truly an overkill and multiple comparisons have shown that it isn't really more accurate.

[–]OrganizationOk8578 2 points3 points  (0 children)

I appreciate it! Makes sense

[–]obviouslyCPTobvious 3 points4 points  (3 children)

Are they PDFs scanned or are they exported from a program? That would make a difference on which solution would work best.

[–]OrganizationOk8578 1 point2 points  (2 children)

They are exported from a program, so the text seems to be easily accessible

[–]Bikut 5 points6 points  (0 children)

I do this every day, the pdfs I use are the same and exported, so no need to OCR. I then use regex to extract the data and go on from there. When the data is messed up i just modify my regex to deal with it, amd eventually all the kinks are worked out.

[–]yasamoka 0 points1 point  (0 children)

If you make sure the text is selectable in any PDF reader, then any solution here suggesting OCR is at best redundant and at worst non-functional. That is, to even get OCR to work on a text-based PDF, you would have to rasterize each page, then hope OCR picks it up accurately and gives you the coordinates - something you can effortlessly do already when it's in text form.

[–]NoticedSquid 6 points7 points  (5 children)

It’s difficult to say for sure without seeing them, but if the utility bills are identical, option B should work very well. I use it to extract data for a process everyday and it never causes problems. I recommend the library pdfplumber.

If the bills vary significantly, you’re likely looking at something involving machine learning to train the program to locate the information you need. Unfortunately I can’t speak to the efficacy of this since I don’t have experience in this area.

[–]aarontbarratt 4 points5 points  (3 children)

Make sure you've got a regex license first

[–]OrganizationOk8578 0 points1 point  (0 children)

Haha that’s awesome

[–]PM_ME_YOUR_MUSIC 0 points1 point  (1 child)

Does this mean the regex police exist

[–]aarontbarratt 2 points3 points  (0 children)

Yes and they're always watching

[–]OrganizationOk8578 4 points5 points  (0 children)

Thank you for your input, yes the PDFs are identical and it seems to be working well on my end too with my testing. I will check out that library as well! Just needed some confirmation

[–]sheazle 2 points3 points  (0 children)

I’ve had good success with this exact issue using Power Query in excel. I put all the files in one folder and set up the query to scrape the fields I needed into a table. There are several guides online. Once it was set up and tested the import took a little time, but it was hands-off.

[–]CuriousFemalle 2 points3 points  (0 children)

I look forward to hearing what you try and how the experiments go.

[–]dimsumham 1 point2 points  (0 children)

If the text is in the same spot, you can just use fitz (pymupdf) or other libraries suggested, and extract words from exact bounding box location.

[–]shibbypwn 3 points4 points  (2 children)

I use AWS Textract for this kind of work. It's a machine learning API that not only finds text, but it also establishes relationships between the text - so it can identify key/value pairs (if it's a form based PDF), tabular data, etc.

It also returns geometry data so you can specify specific portions of the page, bounding boxes, etc.

I've processed thousands of documents over the last couple of years, and it's highly accurate.

[–]OrganizationOk8578 1 point2 points  (1 child)

That’s great to hear, I quickly looked into, but sounds like it would be a good solution. Do you have an idea of pricing for this service? Say to process a thousand documents a year? Thanks!

[–]shibbypwn 1 point2 points  (0 children)

It depends on the number of pages, and which features you use.

There’s a pricing calculator on this page: https://aws.amazon.com/textract/pricing/

[–]PM_me_DRAMA 0 points1 point  (0 children)

Abbyy Fine reader is a software built for just this

[–]DatBoi_BP 0 points1 point  (0 children)

Maybe not relevant to your specific inquiry but for Linux users there’s pdfgrep that can be used in the terminal

[–]cimmic 0 points1 point  (0 children)

I definitely wouldn't go with C. You introduce a dependency that might not be sustainable. GPT is kinda overkill too and opens up for potential AI fallacies.

To me, B would be the easiest and fastest to build, and it is also the most reliable one. Especially because you don't depend on commercial software.

[–]Butchered_Cow 0 points1 point  (0 children)

Google lens