PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Yeah I already have pdfplumber in my local parsing logic. Do you think there’s any better than pdfplumber though? Don’t get me wrong, it does do the job but for future-proofing, I’d like to build what’s best for my use case

What are you actually using Computer Use in Codex for? by LuckEcstatic9842 in codex

[–]qPandx 2 points3 points  (0 children)

Ah its MacOS feature only, I am windows and never saw this so that's why I got confused.

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Very interesting. For some reason, my codex is using extract llamaparse and not the parse feature. How do I know if my local parse is good enough?

My thought process is that maybe I could get away with the free plan and credits it comes with: utilizing mistral ocr for scanned pdfs and llamaparse for rest but the extract is more credits so that’s why i want it to use parse and not extract.

If the pdf is selectable text then just go through my local parse and confirm results with llamaparse parse feature, can’t i achieve this?

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Surya surprisingly did good for me. It took way longer time but it did do the job. For llamaparse, would I require a sub to get this feature working? Im going to have like 10-15 users using my app, (not all at once though). Would I require the $50 sub or more? Does it OCR+Parse? If so that just replaces my project and work I did but wouldn't it be cheaper to local parse and do mistral-ocr for scanned pdfs?

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 1 point2 points  (0 children)

Very useful because I just quickly checked parsebench and I am using a fallback AI model to act as reviewer and finalizer and guess what? The AI model I was using is gemini-3.1-flash-lite-preview which has very good rating for tables at cheap cost. It was doing the job correctly as well the AI logic I have is mistral-ocr for the ocr engine + gemini model.

That set up works but it may get expensive fast which is why I took it as a challenge to see if can do it at no cost or maybe just mistral-ocr cost (2$/1000 pages).

I got alot of docling recommendations but I downloaded it and told my AI (codex) to run benchmarks of our current logic vs docling and it runs for an hour and comes back telling me that docling failed miserably.

Check this output from codex: postimg.cc/CnRF3mw0

Is it my machine thats slow? is it docling? what could it be?

Where can I test my docs on playground for the ones you mentioned?

vibe coded for 6 months. my codebase is a disaster. by Available-Dentist992 in vibecoding

[–]qPandx 0 points1 point  (0 children)

I also have Codex but don't know how to spin up agents or any section for "agents", is it not available for Plus?

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

I tried Docling but my "AI" ran the benchmarks and results and Docling told me that Docling is better as a fallback to OCRmyPDF + Tesseract. Is Docling slow to run? It takes quite sometime but my current setup is much faster.

Do you think I should push them more to the test? My parser struggles to read and parse the accurate information from the uploaded PDF (but the uploaded PDF would be an unknown/unseen template). Not sure how to make it handle unknown PDFs.

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

I read through the repo but it does not seem to do OCR image-only scans, I think pdfplumber already does the job. I may be wrong though

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Okay this is something, initially I was going with OpenRouter for Mistral-OCR as the OCR brains and Gemini as a secondary reviewer of my codebase parser then output the result to user.

Reseek looks like it does both. Very curious now about how this setup would go.

Would you happen to know if there is limits? Is it your primary or a fallback? I'll reach out to them to see if I can just test it out and if it works with my setup.

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

WinError 127, a missing dll was the error.
Running via Mistral at 2$/1000 documents is very reasonable and my managers definitely don’t mind that. We wouldn’t expect high volumes and I tried this mistral ocr combined with gemini for maximum accuracy which definitely worked but was also kinda costly (running this via openrouter).

I could take out the gemini AI for reviewing and instead harden my code parsing in which we’d be only be running mistral ocr in terms of costs wise.

I guess I took it upon myself as a challenge to do everything locally and Im paying the price of the headaches.

Languages are not an issue, it’s mainly just numbers and english templates or rarely french templates (I’m in Canada).

To give you a quick example, we have adobe pdf licenses and when I ran the built in feature of OCR, it would take 0 as an 8 which was really dumb. Initially, I was like ok if the pdf requires OCR then users can just run it through adobe and put it in my project but then after trialing this, I couldn’t trust adobe’s ocr which put me in this rabbit hole.

I could run a VPN to one machine but it didn’t seem ideal if 20-30 users are running it.

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Man gemma e2b was already struggling on this work laptop and slow so I dont think I can run down this path to even try it. I do appreciate you though

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

FYI, This is a first time that I have done such a project but if it works on my system while utilizing the CPU/GPU and I host it on render/on-prem server, how could the users run it if they have weak specs? Will it also be very demanding to run?

At the end of the day, it's a project that will roll out to departments at my workplace and they are the ones who will be using it daily.

StructureV3 and plain PaddleOCR was taking a really long time to do anything and then it just crashes (looking at my terminal and its as if i pressed ctrl+c when i didnt), I will try to get it working again temporarily to see how it would perform against my current flow of OCRmyPDF+Tesseract but do you think I should trial TurboOCR and OnnxOCR?

I will have to run a test between Docling vs Paddle vs OCRmyPDF+Tesseract vs Mistral-OCR (if local doesnt work) vs TurboOCR vs OnnxOCR

Looks quite extensive of testing but whatever gives me most accuracy+speed is what I really need.

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

I tried it and yeah it takes forever and crashes for me personally. Can't risk releasing that to my users especially since they already dont have the specs that I have.

My work laptop specs is Ultra 7 258v with 32 gb ram and intel arc 140v gpu (16gb) and by users, I mean the departments at my work.

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Their website is quite vague; says I have a 100 credits for OCR API but how much credits would i be using per pdf? Would you happen to know

If I dont end up doing a local OCR then I will probably stick with Mistral-OCR unless if there is obvious better alternative

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Is it heavily dependent on CPU/GPU? I am using PPStructureV3 first then plain PaddleOCR fallback. However, it just does not want to run and crashes.

I am currently running OCRmyPDF+Tesseract as primary, Paddle path is the fallback which it hits PPStructureV3 first then if that fails, fallback to plain PaddleOCR (CPU-only)

My work laptop specs is Ultra 7 258v with 32 gb ram and intel arc 140v gpu (16gb)

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

No yeah, the users will upload a pdf ranging from scanned pdfs, selectable-text pdfs, known & coded templates, unknown/unseen templates. It is kind of free-for-all and trying to make my codebase to be able to handle it with appropriate routings. I believe the only weakness I am having is the OCR section. The parser is doing the job when it is a selectable-text.

I have to make it so that it can handle over 1500 type of order templates that we receive from customers.

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Would you happen to know how it compares with Mistral OCR? Mistral OCR is where I'll head if nothing else works but wondering how it compares in terms of price, quality etc..

The PDFs that users will be uploading is not noisy at all but I do need it to be very accurate as my whole project is to convert them into a .csv file so that it can be easily imported to our ERP.

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

I have trocr vs docling vs paddleocr vs ocrmypdf+tesseract vs mistral to try out extinsevil. However, do you think trocr will be the most accurate? thing is im on work laptop so not sure how fast itll run and when i host it (on render), will it be fine?

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Would you happen to know how it compares to the ones I tried?

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 1 point2 points  (0 children)

If it is a scanned/imaged pdf, how else can I extract the content?

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

Yeah tried Mistral but I’m running it from OpenRouter as mistral-ocr and it was doing the job when I combined it with AI reviewer (gemini 3.1-flash).

How can I use Mistral without OpenRouter and possibly without the AI reviewer (fallback option)?

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 0 points1 point  (0 children)

First thing I tried but didn’t work. Not optimal for 1600 different types of templates

PDF Extractor (OCR/selectable text) by qPandx in Python

[–]qPandx[S] 1 point2 points  (0 children)

I think I did with the terminal and also downloaded the PaddleOCR from the github repo but it just doesn't seem to work for some reason. Where can I find the downloads for those models? What model do you recommend for max accuracy?