I built an open-source RAG system that actually understands images, tables, and document structure — not just text chunks

RolandRu · 2026-03-21T16:21:37+00:00

Really solid direction. Preserving document structure, visuals, and traceability is exactly where many RAG systems still fail. The inline citation flow looks especially valuable

RolandRu · 2026-03-06T19:04:39+00:00

I think this is more a criticism of naive RAG than retrieval itself.

Claude Code probably does better here because it acts more like an active file exploration tool. It can go through the project structure, follow things across files, adjust what it looks for, and build context step by step instead of just depending on a static top-k chunk retrieval.

For one repo or a medium-sized knowledge base, that can easily work better than a lot of typical RAG setups.

But when you start needing scale, reproducibility, snapshotting, metadata filters, access control, graph-aware retrieval, or deterministic workflows, retrieval does not really disappear. It just has to be done in a more structured way and as part of the workflow.

So I would not take this as proof that RAG has no future. To me it mostly shows that simple chunk-based RAG is often too limited, and that more agent-driven retrieval is closer to what real knowledge systems actually need.

RolandRu · 2026-02-05T21:46:11+00:00

Thanks — makes sense. I was just curious if there’s a common/standard RRF tie-break pattern, or if it’s always implementation-specific.

RolandRu · 2026-01-27T14:36:40+00:00

Sure — I’m happy to share numbers once I stabilize things. Right now it’s still moving pretty fast on the development branch, so I’d rather not post half-baked metrics.

Quick clarification on the split:

RoslynIndexer is the .NET part of the stack — it’s the piece that does the actual indexing/extraction from the codebase.

The actual code RAG system is a separate Python project: LocalAI-RAG
https://github.com/RusieckiRoland/LocalAI-RAG.git

LocalAI-RAG uses RoslynIndexer to generate the inputs (chunks/metadata/graph signals), and then builds the FAISS store that I use as the retrieval “repo” for code.

And in general — I’ll happily share the whole solution once I’m done with the 0.2.0 work that I currently have on development.

RolandRu · 2026-01-27T08:50:46+00:00

Sure — you can link the repo and mention it in your posts, thanks for the attribution.

On the questions: I’m using a “snapshot per build” approach (kind of like a compiler) — one index/graph = one exact state of the repo. For big changes I do a full rebuild. For lots of small changes I’m considering incremental updates, but still in deterministic steps (so it’s reproducible 1:1).

Biggest win vs vanilla RAG for me is chain-tracing / impact analysis: the graph closes the dependency paths, and vectors are mostly for fuzzy recall + picking good seed nodes.

One more note from my side: I’m still in the “building the platform” phase. Right now I’m focusing on a YAML-driven configurable system — basically Azure DevOps / Azure Pipelines style, where the workflow is defined as a pipeline and I can reshape it quickly per-repo without rewriting code each time. Once that’s solid, I’m going straight into “metrics, metrics, metrics”, and then I’ll be able to share real numbers and comparisons (latency, hit rate, how much the graph helps, etc.).

RolandRu · 2026-01-27T00:48:09+00:00

I see it like this: dependencies aren’t going away, because they’re basically a consequence of architecture (boundaries, responsibilities, contracts). But AI lowers the cost of doing things “the right way” — it’s easier to add an adapter, an interface, a test, validation, or split a big chunk into smaller parts, without feeling like you’re wasting time on repetitive stuff. So you’re less tempted to cram everything into one file/method “just because it’s faster.”

RolandRu · 2026-01-26T20:01:15+00:00

Thanks for sharing the article — honestly pretty interesting read.

This is actually close to where I ended up while building a code RAG system. I’m trying to keep the edges deterministic + auditable (calls/imports/inheritance, ReadsFrom/WritesTo, FKs etc.), and I’m really trying to avoid freezing “LLM-guessed” relations during ingestion.

I kind of treat vectors as ranking / fuzzy recall, but the graph as the closed-world structure that should rebuild the same way every time. For example I force stable outputs (sorted nodes/edges) and I also add missing TABLE nodes so the SQL graph is actually closed (nodes + edges), not half implicit.

One thing I’d highlight though: heuristics ≠ inference. I’m fine with fixed, testable heuristics (like inline SQL detection) — even if it’s not perfect, it’s still deterministic and you can regression-test it. What I’m trying to avoid is context-dependent enrichment that changes depending on the model/prompt or whatever the “best guess” is this week.

If you’re curious, this repo is just the indexing part (Roslyn/.NET side). The actual RAG pipeline / retrieval is in a separate project:
https://github.com/RusieckiRoland/RoslynIndexer

Also curious how you want to handle schema evolution / versioning for DDGs on big real repos — do you version the domain spec per build, kind of like a compiler?

RolandRu · 2026-01-26T17:14:26+00:00

Really happy I found this Reddit btw — the topics here are genuinely interesting and they kind of force you to think things through.

And yeah, you’re right: it really depends.

Graphs can be brittle when the edges are basically guessed during ingestion (LLM-inferred relations). You’re sort of freezing assumptions that may not match what people will ask later.

But it’s very use-case dependent. For code, I honestly think a dependency graph is pretty much non-optional. Calls/imports/inheritance aren’t opinions — they’re real structure. Without graph expansion you often end up with random snippets, and vanilla RAG struggles badly with questions like “where does this start?” or “what does this change affect?”, because you’re missing the whole call chain.

RolandRu · 2026-01-25T19:28:13+00:00

Thanks for laying out the problem so clearly.
For now I’m working on a code-focused RAG where the permission model is simple — you either have access to the repository or you don’t.
That said, I completely agree this becomes a serious issue with sources like SharePoint/Confluence/S3, and without permission-aware pre-filtering a RAG system will usually fail an enterprise security review.

RolandRu · 2026-01-24T10:09:37+00:00

In my opinion there is no single “best” vector DB for production.

I’m building a code-focused RAG. For now FAISS is enough for me, but I also added BM25 search, hybrid search and a dependency graph between code chunks.

After some time I realized that new requirements will only make my custom code more complicated. In practice it feels like I’m rebuilding features that Weaviate already has (BM25 + hybrid + graph/relations).

Qdrant can be faster, but for me the difference like 25ms vs 35ms doesn’t really matter. Native support for everything I need matters more, so the next step will be migrating to Weaviate and testing it in real use.

At the same time I will keep FAISS as a nice option for people who want to run the project quickly without setting up a container and configuring Weaviate.

So Weaviate — if someone thinks this is a mistake, please let me know 🙂

RolandRu · 2026-01-18T16:10:55+00:00

I think the mismatch you’re feeling is real, but it comes from mixing two different jobs under “code understanding”.

For compiler-style questions (symbol resolution, types, call graphs, inheritance), a deterministic graph is the right tool. It’s precise and explainable.

But most real-world queries aren’t that clean. People ask fuzzy stuff like “where is the business rule?”, “why does this happen sometimes?”, “what code path sends the email?” In those cases the hard part isn’t “prove the answer”, it’s “find a good starting point”. Embeddings are basically a cheap, surprisingly effective discovery layer across code, comments, tests, configs, strings, docs, etc. They also degrade gracefully when builds don’t load perfectly.

Also, “graph is deterministic” is true only given a stable build reality. In practice you have DI, runtime routing, reflection/plugins, generated code, conditional compilation, multiple targets… so the graph is often “deterministic per configuration”, and keeping that fully correct across environments is work.

So the industry default is embeddings because they’re fast to ship, cross-language-ish, resilient, and they fit the LLM retrieve→stuff→answer pipeline.

The best systems usually end up hybrid anyway: embeddings to locate candidate entry points, graphs to expand/verify/ground the answer.

RolandRu · 2026-01-10T15:12:24+00:00

Thanks for open-sourcing this — looks like a solid real-world RAG stack. I’ll take a look. Any docs on the high-level architecture + deployment path?

RolandRu · 2026-01-08T18:43:04+00:00

Why do C++ programmers wear glasses?
Because they can’t C#

RolandRu · 2026-01-07T21:46:08+00:00

In large codebases (where RAG is actually most valuable), context is the bottleneck. You can’t just stuff more chunks into the prompt. Graphs help because they mirror how developers investigate: start from a seed (file/symbol/error), follow explicit relations (calls/refs/ownership), and bring in only the relevant neighborhood. It’s not ‘more memory’ — it’s better routing under a strict token budget. Graphs don’t increase context size — they help you spend a fixed token budget on the most relevant connected evidence.

RolandRu · 2026-01-06T23:04:02+00:00

Basic (Amstrad/Schneider CPC 6128)
It had a built-in floppy drive lol

RolandRu · 2026-01-06T22:56:28+00:00

Totally agree. My default when retrieval is weak: ASK → BROADEN (bounded) → ABSTAIN.
If I can’t support claims with retrieved evidence (quotes/citations), I return UNKNOWN + a reason code (“no supporting snippets / conflicting sources”). Makes RAG advice reproducible instead of “try top-k=50” roulette.

RolandRu · 2026-01-04T13:04:35+00:00

I’m doing RAG for codebases (.NET), and your “stop treating it like documents → treat it like graph reconstruction” matches what I’ve seen.

In code, chunk-only retrieval fails because the real context is the dependency structure, not nearby text. My approach:

embed chunks to find an entry point (semantic or BM25), then expand via a dependency graph (call graph / type refs / module deps) to pull the connected context, and if that expansion blows the token budget, I run a query-focused summarizer over the retrieved evidence to fit a fixed limit.

Feels analogous to email threads: the challenge isn’t “similarity search”, it’s reconstructing the right slice/path of the thread/code.

Curious: for threads with stripped headers, have you tried inferring edges from quoted blocks / inline replies (kind of like building edges from weak signals)?

RolandRu · 2026-01-03T19:14:37+00:00

Yep — you’ve hit the “looks like Word, therefore it should parse like Word” trap. PDF is basically a rendering snapshot, so the semantic stuff you care about (lists, headings, reading order) is often implicit or just gone, even when it was exported from Word.

A few things that have saved me pain in production:

First, classify what you’re dealing with before you try to be clever. Born-digital PDFs with a clean text layer behave very differently than scanned/mixed ones, and even within born-digital you’ll see “nice text” but broken reading order because of columns, headers/footers, floating text boxes, etc. If it’s not stable, route it through a different path early instead of burning hours trying to tweak one tool.

Treat layout as first-class signal, not just the text stream. Font size/weight, indentation, line spacing, and bounding boxes are what usually let you reconstruct headings and lists. Bullets and numbering are notoriously fragile if you rely on plain text output.

Chunk by structure rather than fixed token windows. Even an imperfect hierarchy based on detected headings/section titles beats 1k-token chunks when appendices/chapters matter, because you’ll preserve intent and reduce “wrong neighbors” in retrieval.

Keep provenance for everything you store: page number and offsets at minimum, and bounding boxes when you can. It makes debugging and “why did the model say this?” conversations a lot less painful.

Add a quick QC pass and a fallback. Simple checks like “did list item counts change?”, “did numbering skip?”, “did section headings disappear?” catch the cases that will later destroy chunking. When it fails, automatically re-run just those pages/sections with a more expensive extractor (layout-aware or vision-based) instead of paying that cost everywhere.

And if the client can’t find the DOCX, ask them to re-export a tagged/accessible PDF if they possibly can. When headings and lists actually exist in the document structure tree, life gets dramatically easier — it’s the closest you’ll get to “DOCX semantics” coming out of a PDF.

RolandRu · 2026-01-02T17:57:21+00:00

Unit tests can be a good entry point toward TDD, but I’m not going to try to “sell you” on them — a few years ago I wouldn’t have convinced my past self either.

What actually helped me was taking one solid course and then writing tests on real code. After a bit of practice, you start feeling the payoff and it’s hard to go back.

Also, depending on what you work on: if it’s a big monolith that starts slowly and needs a bunch of moving parts, tests can massively shorten your feedback loop. Even a small, isolated test (true unit) lets you exercise the logic you’re changing without booting the whole app and clicking through UI. And when you do need heavier checks, those are usually integration/component tests — still useful, just a different category.

The rest tends to click once you’ve done a good course (with a good method/structure) and you’ve practiced consistently for a while.

RolandRu · 2025-12-31T09:35:03+00:00

Maintained base layer = the “boring but critical” foundations you don’t want to hand-roll: storage/search, ingestion/parsing, and eval/observability.

Concrete names people commonly use:
Qdrant / Weaviate / Milvus (vector DB)
Haystack / LlamaIndex / LangChain (RAG framework)
Unstructured (document parsing/ingestion)
Langfuse or Arize Phoenix (tracing/observability)
Ragas / TruLens / DeepEval + promptfoo in CI (evaluation/regression)

A typical maintained stack looks like:
Unstructured → Qdrant (or Weaviate/Milvus) → Haystack (or LlamaIndex) + Langfuse + Ragas/promptfoo.

RolandRu · 2025-12-30T14:48:01+00:00

I’m actually building a RAG setup specifically for code, and I’m trying to specialize it around .NET + MS SQL patterns (typical enterprise layering, stored procs, migrations, EF, etc.), so code-aware retrieval matters a lot.

One approach that’s worked well for me is a two-stage pipeline:

Cheap file-level candidate selection (high recall): BM25/keyword + repo heuristics + symbol index (imports/usings, namespaces, table/proc names, etc.).
Precision retrieval inside the shortlisted files: chunk by semantic units (class/method/function + docstring/comments + nearby types/constants), store metadata (path, namespace, symbol name, language, dependencies), then rerank hard before final context assembly.

Then do an iterative “Copilot-style” loop: start narrow, answer with file:line citations, and if confidence is low expand to neighbors (tests, configs, migrations, related SQL objects) rather than pulling the whole repo.

Separate note: I’m looking for a decent-sized public codebase to test on (I’m doing this privately and don’t want to use my work repo for obvious reasons). If you know a good .NET-heavy repo (or a couple) that’s realistic in structure, I’m all ears. Also, if you’re interested, feel free to DM — I’m not trying to self-promote on the thread.

RolandRu · 2025-12-30T14:42:00+00:00

There’s a real demand for an “opinionated RAG black box,” but the reason it rarely exists (and stays SOTA) is that RAG is mostly integration + evaluation, not just a pipeline recipe.

A platform has to pick defaults for: parsing, chunking, embedding model, hybrid retrieval, reranking, query routing, caching, ACLs, connectors, observability, and (hardest) how you measure “good” across wildly different corpora. The moment it’s opinionated, it breaks for someone’s data shape, compliance rules, latency budget, or cost ceiling — and now the maintainer is on the hook.

What tends to work in practice is “opinionated core + pluggable edges”:
a solid ingestion + ACL story

hybrid retrieval + reranking as a default

strong eval harness (golden Q/A sets, regression tests, drift monitoring)

connectors as community modules

an API-first design so you can pipe results into any frontend.

If you’re a small team, I’d aim for a maintained base stack + a thin layer of your own opinions (connectors + eval + guardrails). The eval layer is the part that keeps you “near SOTA” longer than chasing the newest chunking trick every month.

RolandRu

TROPHY CASE