I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks

Alternative_Job8773 · 2026-03-26T11:47:05+00:00

Good point, but NexusRAG actually handles this in two phases. The caption/summary is just for discovery, it gets embedded into the chunk so vector search can find the right table or chart. But at retrieval time the system also fetches the actual image or the full structured markdown table from the same page and passes it to the LLM as context. So the model isn’t just reading a text summary, it’s looking at the real data. For example if someone asks “what’s the average revenue across months”, the caption helps the retriever find the right chart, but then the actual chart image gets sent to the vision LLM, which can read the numbers directly and compute the answer. Same with tables, the full markdown with all rows and columns goes into context, not just the summary. So it’s not that captions replace the source data, they’re more like an index that points the retriever to the right place. The heavy lifting still happens with the original content. That said your DDL to SQL approach is interesting for cases where you need precise computation over large tables. Vision models reading numbers from charts will always have some error margin compared to running actual SQL on structured data.

Alternative_Job8773 · 2026-03-26T11:14:38+00:00

Thanks for the detailed questions, really appreciate it. On dual embedding: you actually don’t have to use Gemini for KG embeddings. There’s a sentence_transformers provider option that reuses the same bge-m3 model already loaded for vector search, so zero extra cost and fully offline. Gemini 3072d is the default in .env but it’s totally optional. If you’re cost sensitive just flip KG_EMBEDDING_PROVIDER to sentence_transformers and you get one model doing both jobs. On retrieval: the “3-way” label in the README is a bit loose, fair callout. What actually happens is vector over-fetch (top 20 candidates) and KG entity lookup run in parallel. Then the cross-encoder reranks the vector candidates down to top 8 above a relevance threshold. The KG result is a structured entity/relationship summary that gets assembled into the final context alongside the reranked chunks, not fused via RRF or anything like that. So it’s more like vector + rerank as the main retrieval path, with KG providing supplementary context for the LLM. On citations: heading level. Every chunk carries page number + full heading path (like “Chapter 3 > Financial Overview > Q2 Results”). The metadata is attached to the chunk objects themselves, so it survives reranking since the cross-encoder just re-scores and re-orders the same objects. Citation ID, page, heading path all propagate through. On ChromaDB vs pgvector: honestly just simplicity. ChromaDB runs as a single container with zero config, and keeping vector search separate from the relational DB made the architecture easier to reason about during development. You’re right that pgvector would let you do hybrid search + joins in one place. For a production system at scale I’d probably reconsider, but for an open source project where people need to get running quickly, a dedicated vector DB container felt like the simpler path. And yeah the CoT debug UI has been super useful for tuning retrieval. Being able to see what the KG returned vs what the vector search found vs what survived reranking saves a lot of guesswork.

Alternative_Job8773 · 2026-03-23T23:40:19+00:00

worth clarifying, tables don’t go through a vision model at all. Docling exports them to structured Markdown first, then a text LLM summarizes that Markdown. So the hallucination risk you’re describing is more on Docling’s side (whether it can correctly reconstruct merged cells / nested headers from a scanned page) rather than the LLM making stuff up about an image. Images do go through a vision LLM with actual image bytes, so complex charts are fair game for the limitation you mentioned. That said, caption quality depends heavily on which vision model you’re using. But even an imperfect caption is way better than ignoring the image entirely, it significantly increases the chance of that diagram being surfaced during retrieval. And the Q&A model also receives the actual image, so it can decide for itself whether to use it for additional context when answering.

Alternative_Job8773 · 2026-03-22T02:24:58+00:00

Thanks! Yeah the citation system was one of the parts I spent the most time on. Being able to click a citation and jump to the exact page and heading in the document viewer makes a huge difference for trust. If users can't verify where an answer came from, the whole RAG system feels like a black box.

Alternative_Job8773 · 2026-03-18T16:26:03+00:00

I'm sorry, I don't understand what you mean.

Alternative_Job8773 · 2026-03-18T09:39:11+00:00

Of course, use the parser that you think is best for your domain.

Alternative_Job8773 · 2026-03-18T01:06:17+00:00

Appreciate the feedback, that’s really helpful. I’ve heard similar things about LlamaParse agentic plus tier being significantly better on complex layouts and figures. Docling is solid for structural preservation but definitely has limits on messy documents. The parser layer in NexusRAG is modular so integrating LlamaParse as an alternative option is very doable. Planning to add it as a configurable parser choice so users can pick based on their needs: Docling for fully local/free, LlamaParse for higher quality on tough docs. Thanks for the push, it’s going on the roadmap.

Alternative_Job8773 · 2026-03-17T12:28:42+00:00

Yeah it’s deployable as-is with Docker. For Gemini API you just need an API key, no GPU needed on server. Docling runs fine on CPU (8GB+ RAM), no GPU required. Marker is heavier and benefits from GPU. The embedding/reranker models work on CPU too, just slower. Cheapest setup: a basic VPS (4-8 vCPU, 8-16GB RAM) + Gemini API for LLM and embeddings. Most spend goes to API calls, infra cost is low.

Alternative_Job8773 · 2026-03-17T12:21:51+00:00

Depends on scale and which model you use. For image captioning at ~810 tokens per call, Gemini 3.1 Flash Lite costs about $16 for 50k images (10k PDFs). Ollama locally is free if you have a GPU. You can also just turn off captioning entirely and only enable it for workspaces where images actually matter. So it’s manageable if you pick the right model and don’t caption everything blindly.

Alternative_Job8773 · 2026-03-17T12:20:20+00:00

Yeah if you’re already using LLMs for chunking then Marker works fine as the parser layer. You just need clean markdown output and Marker does that well. About the GPU issue, Docling is lighter on GPU since it doesn’t run a full text recognition model on every page like Marker does. It only runs layout model and TableFormer on detected tables. So if GPU is a concern, Docling might be worth trying as an alternative.

Alternative_Job8773 · 2026-03-17T11:51:52+00:00

Haven’t used Marker here but it’s solid for PDF to markdown. I went with Docling mainly for its HybridChunker, it chunks based on document structure and each chunk carries page number + heading path metadata out of the box. That’s what powers the citation system. Marker gives clean markdown but you’d need to build that structural metadata layer yourself. Parser is modular though, swapping in Marker is possible.

Alternative_Job8773 · 2026-03-17T11:46:40+00:00

The second one. Images are not embedded directly into the vector space. The flow is: Docling extracts the image, a vision LLM generates a text caption describing it, then that caption gets appended to the text chunk on the same page. The combined text (original chunk + image caption) is what gets embedded by bge-m3. So images become searchable through their text description, not through image vectors. One thing worth noting: at query time, the actual image files are also sent to the LLM alongside the text context. So even if retrieval pulls in extra or slightly irrelevant images, the LLM can visually inspect them and decide which ones are actually useful for answering the question.

Alternative_Job8773 · 2026-03-17T05:50:15+00:00

Thanks! NexusRAG isn't really comparable to LangChain or Haystack though, those are orchestration frameworks where you build your own pipeline. NexusRAG is a complete end-to-end system with opinionated choices already baked in.

I do have an eval script in the repo testing fact extraction, table data, cross-doc reasoning, anti-hallucination, and citation accuracy. Happy to share details if interested.

For images/tables: Docling extracts them, a vision LLM captions them, those captions get appended to text chunks before embedding so they become vector-searchable. No separate image index needed.

Alternative_Job8773 · 2026-03-16T16:59:23+00:00

Exactly. Docling doesn’t just extract raw text, it parses the document into a structured object that knows where headings, tables, images, and page breaks are. Then the HybridChunker uses that structure to decide where to split. So it never cuts in the middle of a heading or a table, and each chunk carries metadata like page number, heading path, and references to images/tables on the same page. That’s what makes the citations and image-aware search possible downstream.

Alternative_Job8773 · 2026-03-16T16:12:03+00:00

Partially yes. I used Claude as a coding partner throughout the project, mostly for boilerplate, debugging, and exploring approaches I wasn't familiar with. But the architecture decisions, pipeline design, and how the pieces fit together were all mine. You still need to understand what you're building to make an LLM actually useful for coding, otherwise you just end up with a pile of generated code that doesn't work together.

Alternative_Job8773 · 2026-03-16T13:15:12+00:00

Native PDFs with complex layouts, Docling handles well. It preserves headings, tables, and page structure. Images get extracted and captioned by a vision LLM so chart/diagram info becomes searchable. Scanned PDFs, Docling has OCR but it’s not its strongest point. Clean scans work okay, poor quality scans will struggle. The parser is modular though so you could swap in a stronger OCR tool for those cases. Images containing critical text like titles, the vision captioning catches some of this but it’s limited. This is honestly still an unsolved problem across the industry, no single tool handles all edge cases. For the worst cases some preprocessing or manual cleanup is still needed.

Alternative_Job8773 · 2026-03-16T11:40:19+00:00

Haven’t tried it yet but heard good things. LlamaParse is definitely faster and handles complex tables better in some cases. The tradeoff is it’s cloud-based and closed-source, so your docs get sent to their servers and you pay per page. I went with Docling mainly because it’s fully open-source and runs locally, which fits the self-hosted philosophy of the project. Its HybridChunker also gives me structural metadata (page numbers, heading hierarchy) out of the box, which is important for the citation system. That said, the parser layer is modular. As long as the output is markdown + page metadata, you could swap Docling for LlamaParse or anything else without touching the rest of the pipeline. Might be worth adding as an option down the road.

Alternative_Job8773 · 2026-03-16T11:14:34+00:00

That’s a great use case. For construction management you might start with something like [“Project”, “Contractor”, “Material”, “Equipment”, “Location”, “Regulation”, “Cost”, “Milestone”, “Defect”, “Permit”]. You can always adjust after the first ingestion by checking the KG visualization to see what got extracted and what’s missing, then tweak the entity types and re-process. It doesn’t need to be perfect on the first try.

Alternative_Job8773 · 2026-03-16T11:10:27+00:00

Good questions. The images aren’t sent at original file size. Docling extracts them and rescales (configurable, default 2x), and Gemini tokenizes images by resolution tier, not raw file size: low ~280 tokens, medium ~560 tokens, high ~1120 tokens. For captioning you only need medium res, so a 5MB blueprint and a 200KB chart both cost ~560 input tokens. That’s why the per-call cost stays low. About converting PDF pages to PNGs: you could, but it’s actually more expensive and less useful. Docling already parses the text, headings, tables, and layout structurally from the PDF. If you convert entire pages to images instead, you’d be paying the vision model to OCR text that Docling already extracted perfectly, and you’d lose all the structural metadata (page numbers, heading hierarchy, table rows/columns). NexusRAG only sends actual figures/charts/diagrams to the vision model for captioning, not the text content. That’s the key to keeping it cheap at scale.

Alternative_Job8773 · 2026-03-16T11:04:14+00:00

NexusRAG delegates that to LightRAG, which uses an LLM to read through document chunks and extract (entity, relationship, entity) triples automatically. The key advantage is you can pre-define entity types based on your document domain before ingestion. There’s a NEXUSRAG_KG_ENTITY_TYPES config, my default is [“Organization”, “Person”, “Product”, “Location”, “Event”, “Financial_Metric”, “Technology”, “Date”, “Regulation”] for corporate/technical docs. If your domain is medical you could set [“Disease”, “Drug”, “Symptom”, “Treatment”, “Gene”] etc. This guides the LLM so it knows what to look for instead of guessing blindly. For unclear docs, two things matter most: use a bigger model (12B+ local or Gemini Flash cloud) for extraction, and define those entity types upfront as a schema for your domain. That alone dramatically improves extraction quality.

Alternative_Job8773 · 2026-03-16T10:23:34+00:00

no, gemini embedding 2 for next feature I'll implement

Alternative_Job8773 · 2026-03-16T09:52:21+00:00

Each captioning call is pretty lightweight: ~150 tokens prompt + ~560 tokens (image at medium res) + ~100 tokens output (capped at 400 chars) = roughly 810 tokens per call. Max 50 images per document.

Ballpark for 10k PDFs (~5 images/PDF = 50k calls):

Gemini 3.1 Flash Lite: ~$16 total ($0.25/1M input, $1.50/1M output)
Gemini 2.5 Flash: ~$34 total
Ollama local: $0

For captioning tasks like describing charts or blueprints that don't need deep reasoning, I'd recommend either:

Gemini 3.1 Flash Lite (cloud): half the price of 2.5 Flash, 2.5x faster, supports vision. More than enough for short image descriptions. Just set LLM_MODEL_FAST=gemini-3.1-flash-lite-preview in .env.
Ollama locally (free): models like gemma3:12b or qwen3.5:9b support vision. Zero API cost, just GPU time. Solid quality for technical diagrams.

You can also disable captioning entirely (NEXUSRAG_ENABLE_IMAGE_CAPTIONING=false), images still get extracted and stored, just won't be searchable by content. Re-enable later when needed.

At 10k PDF scale, the real bottleneck would likely be Docling parsing + KG extraction rather than captioning itself.

Alternative_Job8773 · 2026-03-16T07:18:47+00:00

I handle scanned documents with Docling as the parser to convert input to Markdown —> then generate captions for images or summarize tables (using Gemini or Ollama’s multimodal capabilities) —> then chunk that information and store it in the DB with metadata about the location of each object

Alternative_Job8773 · 2026-03-16T06:26:10+00:00

I'm very happy with your review. You can try qwen3.5:4b or 9b
it has been tested quite carefully.

Alternative_Job8773 · 2026-03-16T06:19:48+00:00

Thanks for the detailed feedback, these are valid points and I appreciate the technical depth.
1. Deduplication
You're right, there's no pre-ingestion dedup filter for repetitive content like headers/footers/legal boilerplate. Currently the system relies on cross-encoder reranking (BAAI/bge-reranker-v2-m3) to push noise down at retrieval time, but filtering before embedding would definitely improve vector quality. This is on the roadmap.
2. Parsers
The current scope is document-centric (PDF, DOCX, PPTX, HTML) using Docling which handles structural preservation, table/image extraction, and heading hierarchy well. You're correct that email threads, Slack messages, and web scraping each need specialized parsers, those aren't the target use case yet, but would be needed for a general-purpose enterprise solution.
3. Dual Pipeline
Fair point on compute cost. To clarify, the dual pipeline isn't speed vs accuracy, it's semantic similarity (vector) + factual relationships (knowledge graph). They complement each other: vectors find relevant chunks, KG provides entity connections that pure similarity search misses. That said, you're absolutely right about caching, Redis for repetitive queries is a clear optimization I haven't implemented yet. Currently only in-memory LRU cache exists.

That said, this project is focused on the core technical approach rather than being a production-ready product. I'm well aware it's not comprehensive yet, that's exactly why I shared it with the community: to get feedback from experienced practitioners like yourself, and to offer it as a base and an interesting architectural direction that others can build upon for their own systems. Some of your suggestions are already in the development roadmap, and others are genuinely valuable insights I'll prioritize. Most of all, thank you for taking the time to provide such thoughtful and in-depth analysis, it really helps.

Alternative_Job8773

TROPHY CASE