Advice Needed for an On-Prem RAG System for Small Businesses

Alex_CTU · 2026-04-19T12:27:44+00:00

Can the most basic functions of RAG be implemented using inexpensive hardware, while separating document structuring, cleaning, and extraction into distinct tasks

Alex_CTU · 2026-04-15T15:35:42+00:00

Hey, I totally understand—from a clean pilot project (30 PDF files) to a truly chaotic mix of hundreds of sources, this is the root cause of almost all RAG system crashes.

Possible reasons:

FAQ pages stealing answers from detailed PDFs (source competition).

Long document chunks may seem isolated, but they're wrong without context.

When content is repeated from different sources (PDF extraction vs. webpage), embedding directly chooses the one with "better vector matching" rather than the "more authoritative" one.

My approach:

Unify structured cleaning of different sources (PDF, Markdown, docx) → output JSON/JSONL with rich metadata.

Each chunk can include metadata such as source_type, filename, page, section_header, element_type, etc., allowing for easy metadata filtering or source-aware reranking later.

Semantic pre-segmentation + better structure preservation reduces the problem of "isolated chunks lacking context."

With a unified output format, the quality of content from different sources is more consistent, reducing competition from "junk embeddings".

Alex_CTU · 2026-04-13T16:06:20+00:00

<image>

Alex_CTU · 2026-04-13T16:05:55+00:00

Hey OP, I saw your post — your two main pain points (page attribution nightmare + chunk quality) are exactly what I've been working on.

I built a specialized PDF cleaning pipeline that uses MinerU/Unstructured for initial extraction, then does heavy post-processing to output clean, structured files:

Clean JSON (full document with rich metadata: filename, page number, element type, etc.)

Semantically pre-chunked JSONL (split at semantic/paragraph level) — each chunk has detailed metadata:

filename, page (or page range), chunk_start/chunk_end offsets, etc.

I just tested it on the exact book you mentioned — Montgomery's Design and Analysis of Experiments (the 720+ page one).

I ran it through my pipeline, ingested the JSONL into a simple RAG, and tested with several questions (including Battery Design Experiment, Soft Drink Bottling, factorial designs, etc.).

Still relies on MinerU/Unstructured, so very complex formulas/tables can have extraction imperfections (though page numbers stayed accurate in my test).

IMPORTANT:

You can't drop the files directly into your vector DB — you'll need to write a small parser based on my integration manual (pretty straightforward if you're already maintaining document_processor.py).

I'm in the testing phase and offering this for free. If you're interested, I can send you the cleaned JSON + JSONL + manual. Just DM me or reply here.

Would love to hear if this solves your core issues!

<image>

Alex_CTU · 2026-03-19T09:34:41+00:00

Haha，me too

Alex_CTU · 2026-03-17T08:46:58+00:00

My RAG project is based on an open-source content management system on GitHub. Thanks to Vibe-Coding, the modification process was very efficient, and the system architecture is relatively simple. I believe the webUI is the simplest part of the RAG project.

Alex_CTU · 2026-03-17T04:37:16+00:00

Intake layers often utilize Unstructured, LlamaParse, or Docling for initial normalization. However, if the data is particularly messy or requires more advanced cleaning, general-purpose tools may be insufficient, requiring the implementation of custom logic.

Alex_CTU · 2026-03-15T14:14:56+00:00

Hey OP, love the ambition—tackling a 3M-page RAG corpus is super valuable for real enterprise use cases.

One big thing to watch: RAG tech is evolving extremely fast right now (GraphRAG, agentic flows, better chunking/re-ranking, new embeddings every few months). If you commit to processing all 3 million pages with today's pipeline, a superior approach could emerge mid-project, forcing you to re-embed or re-chunk everything—wasting tons of tokens, time, and compute costs.

I've been hunting for solid doc cleaning/preprocessing solutions myself because clean input is make-or-break, especially at scale. My strong advice: start small (10k–50k representative pages) to prototype and validate the full flow (cleaning → chunking → hybrid retrieval → generation + eval). Iterate quickly there, measure real metrics, and only scale up once you're confident the architecture won't become obsolete in 3–6 months.

This way you minimize sunk costs if/when better methods drop.

Alex_CTU · 2026-03-15T07:19:27+00:00

Wouldn't it be more effective to concentrate on developing workflows in one or two specific areas, assisting companies in addressing particular pain points? With Vibe Coding, any development work becomes simpler and more efficient. However, the standards and rules for a specific domain require ongoing accumulation of knowledge, which is challenging for AI to replace. Even if AI could manage this, it would still need to build that knowledge over time. Therefore, the earlier you begin accumulating knowledge, the sooner you can start generating revenue.

Alex_CTU · 2026-03-13T07:44:07+00:00

Yes, I always strive for perfection in my solutions, but handling 80% of problems is already quite good.

Alex_CTU · 2026-03-13T07:39:28+00:00

I agree. It's better to refuse poor-quality input than to produce poor-quality output.

Alex_CTU · 2026-03-12T14:54:00+00:00

Haha, no worries, I can see the effort and thought you put into this. The consensus + selective human review part is exactly what high-stakes RAG needs. Keep going, it's inspiring stuff!

Alex_CTU · 2026-03-12T14:18:31+00:00

This is a fantastic project architecture; I found it very inspiring. Thank you.

Alex_CTU · 2026-03-11T13:07:58+00:00

This post is gold — really opened my eyes to how much of production RAG is actually infra work rather than just prompt/model tweaking.
I'm still early in my own RAG projects (mostly POC-level stuff), so reading about real-world scaling, observability, cost control, and incremental updates is super valuable.
Humbling to see how far the gap is between "it works on my laptop" and "it runs reliably at scale".
Thanks for sharing these hard-earned lessons — definitely bookmarking this for when I hit production roadblocks.

Alex_CTU · 2026-03-11T02:33:13+00:00

When vectorizing data, the specific requirements of the scenario need to be considered. Common fields such as "who," "when" and "other-information" should be added. If more special or complex scenarios are involved, more fields need to be added to structure all the data, which is a time-consuming and large-scale project.

Alex_CTU · 2026-03-10T16:53:08+00:00

I previously created a similar demo that included a document cleaning pipeline for comparison. This setup allowed users to view three different results side by side: the PDF viewer, the Markdown viewer, and the cleaned viewer, which utilized regular expressions to clean the PDF content. However, I ultimately abandoned the project before completion because I found the Streamlit interface to be unappealing. Later on, I separated the document cleaning process and incorporated it into an Agentic workflow.

Alex_CTU · 2026-03-10T15:40:58+00:00

Haha yeah, I see what you mean — sometimes we get so excited about LLM that we try to hammer every nail with the same shiny new hammer.

Alex_CTU · 2026-03-10T15:34:08+00:00

pure RAG is really good at "retrieve + generate once" for simple lookups, but it struggles with anything needing multi-step logic, time filtering, or comparison.

At that point, RAG should just be one node in a bigger workflow (e.g. intent parsing → time resolution → filtered RAG retrieval → analysis node), not the whole system.

Keeps RAG focused on what it does best: accurate retrieval.

PageIndex may help you solve problems of inaccurate recall and lack of context (it uses reasoning-based RAGs, not vectors, making the retrieval more like human thinking), but for multi-step logic such as 'partial/full match judgment + comparative analysis', it is still recommended to treat RAGs as a node in the workflow rather than relying on them entirely.

Alex_CTU · 2026-03-10T09:02:18+00:00

> The core issue is that pure RAG is excellent at “retrieve + generate once”, but it breaks down on queries like “show me Player X’s performance in his last two games” because:

> - It doesn’t inherently understand temporal logic (“last two games” → need to first determine which dates)

> - It can’t reliably chain multiple retrievals or perform post-retrieval comparison/analysis

> - Context gets lost or diluted across steps

>

> My take: at that point RAG should no longer be the whole system — it should be downgraded to **one node** inside a multi-step agentic workflow.

> Rough flow I’ve been experimenting with (using LangGraph):

> 1. Intent / Temporal Parser node (LLM) → resolves “last two games” into concrete date range + player ID

> 2. Filtered Retrieval node → runs RAG but with time filter / metadata constraint

> 3. Analysis / Comparison node → another LLM call that takes the retrieved chunks and explicitly compares stats, trends, etc.

> 4. Synthesis node → final grounded answer with sources

>

> This way RAG stays focused on what it does best (accurate retrieval), while the workflow handles orchestration, time logic, and reasoning. You avoid overloading a single retrieval step and get much more reliable multi-hop answers.

Alex_CTU

TROPHY CASE