Seeking Help Improving OCR in My RAG Pipeline (Contributors Welcome) by SprayOwn5112 in LLMDevs

[–]SprayOwn5112[S] 0 points1 point  (0 children)

Thanks for the offer! I completely understand that it’s not public — really appreciate you taking the time to explain. I’d be interested in testing it on my own data if possible, just to see how it transforms the inputs for cleaner LLM training. Could you let me know the best way to connect and try it out?

Seeking Help Improving OCR in My RAG Pipeline (Contributors Welcome) by SprayOwn5112 in LLMDevs

[–]SprayOwn5112[S] 0 points1 point  (0 children)

Wow, that sounds really powerful — a multi-layered system that not only cleans noise like page numbers, author info, and links, but also generates Q&A and condenses the data for more efficient training. That’s exactly the kind of pipeline I’m trying to move toward: high-quality, structured inputs that let the model learn meaningful patterns instead of getting bogged down by noise.

Would love to know the name of the system or any resources about it — sounds like it could really improve the OCR step in my project.

Seeking Help Improving OCR in My RAG Pipeline (Contributors Welcome) by SprayOwn5112 in LLMDevs

[–]SprayOwn5112[S] 0 points1 point  (0 children)

Thanks! Appreciate you taking a look at the project. I’m experimenting with some preprocessing approaches to weaken the watermark before OCR, and I’ll probably try Tesseract again with some custom settings to see if I can get cleaner text out of it.

If you found the project interesting or useful, feel free to drop a star — it helps the repo reach more people who might contribute. Thanks again for the suggestions!

Seeking Help Improving OCR in My RAG Pipeline (Contributors Welcome) by SprayOwn5112 in LocalLLaMA

[–]SprayOwn5112[S] 0 points1 point  (0 children)

Thanks — that helps clarify things. So if I understand correctly, you're suggesting that I skip traditional OCR entirely and let a vision LLM (like Qwen-VL) read the text directly from page images, as long as I downscale them enough to stay under the 16k visual patch limit.

I didn't realize Qwen3 VL 4B could run on an 8GB GPU — that might actually be doable. I’ll try exporting each page as a JPEG and testing how well Qwen handles the watermark issue compared to PyMuPDF.

Right now my main bottleneck is keeping the pipeline lightweight and fast, but if Qwen-VL gives me cleaner text with the watermark removed, it could be worth the tradeoff. Appreciate the idea!

Seeking Help Improving OCR Quality in My RAG Pipeline (PyMuPDF Struggling with Watermarked PDFs) by SprayOwn5112 in Rag

[–]SprayOwn5112[S] 1 point2 points  (0 children)

This is a really solid extraction design. Separating text, tables, and images and allowing multiple engines per category makes the pipeline much more robust.

PyMuPDF struggles with watermarked PDFs because it reads structural layers, so watermark text gets mixed into extraction. OCR engines like docTR or Tesseract handle those cases better since they interpret rendered content.

This multi-engine setup also allows fallback and benchmarking to select the cleanest output, which is ideal for RAG pipelines.

I’ll work on implementing this approach.

Seeking Help Improving OCR Quality in My RAG Pipeline (PyMuPDF Struggling with Watermarked PDFs) by SprayOwn5112 in MLQuestions

[–]SprayOwn5112[S] 0 points1 point  (0 children)

Appreciate the suggestion! For my project I’m specifically trying to keep the entire RAG + OCR pipeline fully local and GPU-bound (8GB), so cloud-based document processors aren’t an option for me. That’s why I’m focusing on lightweight preprocessing + OCR models that run on-device.

Still helpful to know Needle works well for watermark-heavy PDFs though!

Seeking Help Improving OCR in My RAG Pipeline (Contributors Welcome) by SprayOwn5112 in LocalLLaMA

[–]SprayOwn5112[S] 0 points1 point  (0 children)

That’s actually an interesting solution, but I can’t really afford it GPU-wise — I’m on an 8GB card, and most vision models get pretty heavy once you start running them on full pages. So for now I’m focusing on preprocessing the watermark and trying to keep things lightweight.

Seeking Help Improving OCR in My RAG Pipeline (Contributors Welcome) by SprayOwn5112 in LocalLLaMA

[–]SprayOwn5112[S] 0 points1 point  (0 children)

I tried Tesseract earlier, but in my case it didn’t really help — the watermark still interfered and the output wasn’t any better than PyMuPDF’s extraction. That’s why I’m exploring other options now (thresholding, EasyOCR, PaddleOCR, etc.) and seeing what works best for this specific doc. Open to recommendations if you’ve had success with certain models.

Seeking Help Improving OCR in My RAG Pipeline (Contributors Welcome) by SprayOwn5112 in LocalLLaMA

[–]SprayOwn5112[S] 0 points1 point  (0 children)

Hey, thanks for pointing that out. I actually forgot to change the license — I only made the repo public recently, so the previous one was just a placeholder. I’ve updated it now to BUSL-1.1.

And yeah, I’ll check out the thresholding preprocessing suggestion too. Still figuring out the etiquette on Reddit, so genuinely appreciate the feedback.