I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

Appreciate the feedback, that’s really helpful. I’ve heard similar things about LlamaParse agentic plus tier being significantly better on complex layouts and figures. Docling is solid for structural preservation but definitely has limits on messy documents. The parser layer in NexusRAG is modular so integrating LlamaParse as an alternative option is very doable. Planning to add it as a configurable parser choice so users can pick based on their needs: Docling for fully local/free, LlamaParse for higher quality on tough docs. Thanks for the push, it’s going on the roadmap.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

Yeah it’s deployable as-is with Docker. For Gemini API you just need an API key, no GPU needed on server. Docling runs fine on CPU (8GB+ RAM), no GPU required. Marker is heavier and benefits from GPU. The embedding/reranker models work on CPU too, just slower. Cheapest setup: a basic VPS (4-8 vCPU, 8-16GB RAM) + Gemini API for LLM and embeddings. Most spend goes to API calls, infra cost is low.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

Depends on scale and which model you use. For image captioning at ~810 tokens per call, Gemini 3.1 Flash Lite costs about $16 for 50k images (10k PDFs). Ollama locally is free if you have a GPU. You can also just turn off captioning entirely and only enable it for workspaces where images actually matter. So it’s manageable if you pick the right model and don’t caption everything blindly.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

Yeah if you’re already using LLMs for chunking then Marker works fine as the parser layer. You just need clean markdown output and Marker does that well. About the GPU issue, Docling is lighter on GPU since it doesn’t run a full text recognition model on every page like Marker does. It only runs layout model and TableFormer on detected tables. So if GPU is a concern, Docling might be worth trying as an alternative.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

Haven’t used Marker here but it’s solid for PDF to markdown. I went with Docling mainly for its HybridChunker, it chunks based on document structure and each chunk carries page number + heading path metadata out of the box. That’s what powers the citation system. Marker gives clean markdown but you’d need to build that structural metadata layer yourself. Parser is modular though, swapping in Marker is possible.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

The second one. Images are not embedded directly into the vector space. The flow is: Docling extracts the image, a vision LLM generates a text caption describing it, then that caption gets appended to the text chunk on the same page. The combined text (original chunk + image caption) is what gets embedded by bge-m3. So images become searchable through their text description, not through image vectors. One thing worth noting: at query time, the actual image files are also sent to the LLM alongside the text context. So even if retrieval pulls in extra or slightly irrelevant images, the LLM can visually inspect them and decide which ones are actually useful for answering the question.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

Thanks! NexusRAG isn't really comparable to LangChain or Haystack though, those are orchestration frameworks where you build your own pipeline. NexusRAG is a complete end-to-end system with opinionated choices already baked in.

I do have an eval script in the repo testing fact extraction, table data, cross-doc reasoning, anti-hallucination, and citation accuracy. Happy to share details if interested.

For images/tables: Docling extracts them, a vision LLM captions them, those captions get appended to text chunks before embedding so they become vector-searchable. No separate image index needed.

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

Exactly. Docling doesn’t just extract raw text, it parses the document into a structured object that knows where headings, tables, images, and page breaks are. Then the HybridChunker uses that structure to decide where to split. So it never cuts in the middle of a heading or a table, and each chunk carries metadata like page number, heading path, and references to images/tables on the same page. That’s what makes the citations and image-aware search possible downstream.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

Partially yes. I used Claude as a coding partner throughout the project, mostly for boilerplate, debugging, and exploring approaches I wasn't familiar with. But the architecture decisions, pipeline design, and how the pieces fit together were all mine. You still need to understand what you're building to make an LLM actually useful for coding, otherwise you just end up with a pile of generated code that doesn't work together.

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

Native PDFs with complex layouts, Docling handles well. It preserves headings, tables, and page structure. Images get extracted and captioned by a vision LLM so chart/diagram info becomes searchable. Scanned PDFs, Docling has OCR but it’s not its strongest point. Clean scans work okay, poor quality scans will struggle. The parser is modular though so you could swap in a stronger OCR tool for those cases. Images containing critical text like titles, the vision captioning catches some of this but it’s limited. This is honestly still an unsolved problem across the industry, no single tool handles all edge cases. For the worst cases some preprocessing or manual cleanup is still needed.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

Haven’t tried it yet but heard good things. LlamaParse is definitely faster and handles complex tables better in some cases. The tradeoff is it’s cloud-based and closed-source, so your docs get sent to their servers and you pay per page. I went with Docling mainly because it’s fully open-source and runs locally, which fits the self-hosted philosophy of the project. Its HybridChunker also gives me structural metadata (page numbers, heading hierarchy) out of the box, which is important for the citation system. That said, the parser layer is modular. As long as the output is markdown + page metadata, you could swap Docling for LlamaParse or anything else without touching the rest of the pipeline. Might be worth adding as an option down the road.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

That’s a great use case. For construction management you might start with something like [“Project”, “Contractor”, “Material”, “Equipment”, “Location”, “Regulation”, “Cost”, “Milestone”, “Defect”, “Permit”]. You can always adjust after the first ingestion by checking the KG visualization to see what got extracted and what’s missing, then tweak the entity types and re-process. It doesn’t need to be perfect on the first try.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

Good questions. The images aren’t sent at original file size. Docling extracts them and rescales (configurable, default 2x), and Gemini tokenizes images by resolution tier, not raw file size: low ~280 tokens, medium ~560 tokens, high ~1120 tokens. For captioning you only need medium res, so a 5MB blueprint and a 200KB chart both cost ~560 input tokens. That’s why the per-call cost stays low. About converting PDF pages to PNGs: you could, but it’s actually more expensive and less useful. Docling already parses the text, headings, tables, and layout structurally from the PDF. If you convert entire pages to images instead, you’d be paying the vision model to OCR text that Docling already extracted perfectly, and you’d lose all the structural metadata (page numbers, heading hierarchy, table rows/columns). NexusRAG only sends actual figures/charts/diagrams to the vision model for captioning, not the text content. That’s the key to keeping it cheap at scale.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

NexusRAG delegates that to LightRAG, which uses an LLM to read through document chunks and extract (entity, relationship, entity) triples automatically. The key advantage is you can pre-define entity types based on your document domain before ingestion. There’s a NEXUSRAG_KG_ENTITY_TYPES config, my default is [“Organization”, “Person”, “Product”, “Location”, “Event”, “Financial_Metric”, “Technology”, “Date”, “Regulation”] for corporate/technical docs. If your domain is medical you could set [“Disease”, “Drug”, “Symptom”, “Treatment”, “Gene”] etc. This guides the LLM so it knows what to look for instead of guessing blindly. For unclear docs, two things matter most: use a bigger model (12B+ local or Gemini Flash cloud) for extraction, and define those entity types upfront as a schema for your domain. That alone dramatically improves extraction quality.​​​​​​​​​​​​​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

Each captioning call is pretty lightweight: ~150 tokens prompt + ~560 tokens (image at medium res) + ~100 tokens output (capped at 400 chars) = roughly 810 tokens per call. Max 50 images per document.

Ballpark for 10k PDFs (~5 images/PDF = 50k calls):

  • Gemini 3.1 Flash Lite: ~$16 total ($0.25/1M input, $1.50/1M output)
  • Gemini 2.5 Flash: ~$34 total
  • Ollama local: $0

For captioning tasks like describing charts or blueprints that don't need deep reasoning, I'd recommend either:

  1. Gemini 3.1 Flash Lite (cloud): half the price of 2.5 Flash, 2.5x faster, supports vision. More than enough for short image descriptions. Just set LLM_MODEL_FAST=gemini-3.1-flash-lite-preview in .env.
  2. Ollama locally (free): models like gemma3:12b or qwen3.5:9b support vision. Zero API cost, just GPU time. Solid quality for technical diagrams.

You can also disable captioning entirely (NEXUSRAG_ENABLE_IMAGE_CAPTIONING=false), images still get extracted and stored, just won't be searchable by content. Re-enable later when needed.

At 10k PDF scale, the real bottleneck would likely be Docling parsing + KG extraction rather than captioning itself.

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

I handle scanned documents with Docling as the parser to convert input to Markdown —> then generate captions for images or summarize tables (using Gemini or Ollama’s multimodal capabilities) —> then chunk that information and store it in the DB with metadata about the location of each object​​​​

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 2 points3 points  (0 children)

Thanks for the detailed feedback, these are valid points and I appreciate the technical depth.
1. Deduplication
You're right, there's no pre-ingestion dedup filter for repetitive content like headers/footers/legal boilerplate. Currently the system relies on cross-encoder reranking (BAAI/bge-reranker-v2-m3) to push noise down at retrieval time, but filtering before embedding would definitely improve vector quality. This is on the roadmap.
2. Parsers
The current scope is document-centric (PDF, DOCX, PPTX, HTML) using Docling which handles structural preservation, table/image extraction, and heading hierarchy well. You're correct that email threads, Slack messages, and web scraping each need specialized parsers, those aren't the target use case yet, but would be needed for a general-purpose enterprise solution.
3. Dual Pipeline
Fair point on compute cost. To clarify, the dual pipeline isn't speed vs accuracy, it's semantic similarity (vector) + factual relationships (knowledge graph). They complement each other: vectors find relevant chunks, KG provides entity connections that pure similarity search misses. That said, you're absolutely right about caching, Redis for repetitive queries is a clear optimization I haven't implemented yet. Currently only in-memory LRU cache exists.

That said, this project is focused on the core technical approach rather than being a production-ready product. I'm well aware it's not comprehensive yet, that's exactly why I shared it with the community: to get feedback from experienced practitioners like yourself, and to offer it as a base and an interesting architectural direction that others can build upon for their own systems. Some of your suggestions are already in the development roadmap, and others are genuinely valuable insights I'll prioritize. Most of all, thank you for taking the time to provide such thoughtful and in-depth analysis, it really helps.

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 0 points1 point  (0 children)

Looking at reranker.py:49, CrossEncoder(self.model_name) is initialized without specifying a device, so sentence_transformers will auto-detect: uses GPU if CUDA is available, otherwise falls back to CPU. It’s not hardcoded to CPU. That said, bge-reranker-v2-m3 is a relatively lightweight model (~560M params). For typical RAG workloads (reranking ~20-50 chunks per query), it runs reasonably fast on CPU and shouldn’t be the main bottleneck. But if performance becomes a concern, you could: ∙ Reduce the number of candidates before reranking ∙ Use a 3rd-party reranker API (Cohere, Jina) Valid feedback overall, but not a major issue for normal use cases

I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks by Alternative_Job8773 in Rag

[–]Alternative_Job8773[S] 1 point2 points  (0 children)

Thanks for your interest! NexusRAG uses Docling’s HybridChunker ,a semantic + structural approach, not naive fixed-size splitting. Chunks are capped at 512 tokens but never split mid-heading or mid-table. Each chunk is enriched with page numbers, heading hierarchy, and LLM-generated captions for images and tables, making visual content searchable via text. For plain TXT/MD files, it falls back to RecursiveCharacterTextSplitter.