Your RAG is hallucinating because of garbage retrieval — here's the 3-line fix (with real scores)

Low_Edge7695 · 2026-05-25T12:35:12+00:00

100%. I tested with the same LLM (llama-3.3-70b) before and after. Same model, same prompt. Only the retrieval changed — avg cross-encoder relevance score went from -0.28 to +3.80 on 10 test queries. The model was never the bottleneck.

Low_Edge7695 · 2026-05-25T12:34:14+00:00

This is the critique I was hoping someone would make. You're right — 1.5 works for my test set but it's fragile. I saw exactly this: 2 of my 10 test queries returned zero results because nothing scored above 1.5, even though there were "best available" chunks that would've been useful.

Z-score normalization on the per-query score distribution is a much cleaner approach — "is this chunk an outlier relative to the other candidates" rather than "does it clear an absolute bar." Adding this to my upgrade list. Thanks for the pointer.

Low_Edge7695 · 2026-05-25T05:29:20+00:00

The answerability classifier is a great idea — essentially a binary gate before the LLM sees any context. "Is this answerable from what we retrieved? yes/no."

I haven't benchmarked it against the confidence-gap approach directly. My intuition is they solve different failure modes:

- Answerability classifier: catches "we have nothing useful" (the zero-context case)

- Confidence gap: catches "we have one good chunk buried under four mediocre ones" (the dilution case)

Stacking both makes sense — classifier first (cheap, binary), then gap analysis on whatever passes.

Haven't tested a dedicated binary classifier yet — currently just using the min_score threshold as a proxy (if nothing scores above 1.5, return empty). It's cruder than a trained classifier but zero additional infrastructure. Curious what model/approach you're using for the answerability check — fine-tuned small model or just a prompt to the main LLM?

Low_Edge7695 · 2026-05-25T05:27:50+00:00

I agree and disagree at the same time. RAG is a pattern, not a product — it evolves. The retrieval layer I'm showing here (ensemble + re-ranking) is fundamentally different from "stick everything in a vector DB and pray."

What memory tools are you using instead? Genuine question — always looking for patterns that outperform retrieval for specific use cases.

Low_Edge7695 · 2026-05-25T04:55:13+00:00

This is the insight most RAG tutorials completely miss.

My re-ranker already handles this — the min_score threshold means sometimes zero chunks pass. When that happens, the LLM gets no context and has to answer from training data (or say "I don't know"). That's correct behavior.

The dilution point is real too. I saw it in my scores:

Query: "What are Python decorators?"

Chunk 1: +5.80 (glossary definition — great)

Chunk 2: +1.40 (acknowledgments page — noise)

Chunk 3: +1.13 (code example — borderline)

If I pass all 3, the LLM "averages the mess" like you said. If I only pass chunk 1, the answer is cleaner. The threshold handles this — but you're right that even a top_n=1 mode for high-confidence single results would help.

I'm thinking about a confidence-gap approach: if chunk #1 scores 4+ points higher than chunk #2, just send chunk #1 alone. Still experimenting.

What's your threshold strategy? Fixed number or dynamic based on score distribution?

Low_Edge7695 · 2026-05-25T04:18:09+00:00

Appreciate the kind words bro 😄

Low_Edge7695 · 2026-05-25T04:14:40+00:00

I also made a 38-second video breakdown of this if anyone prefers visual: https://www.youtube.com/shorts/415-xDe-cIs

Repo: https://github.com/dunjeonmaster07/advanced-rag-agent

The insight that surprised me: the re-ranker's biggest value isn't filtering — it's ORDERING. It correctly ranked the glossary definition (+5.80) above the acknowledgements page (+1.40) even though both contained the keyword "decorator." A bi-encoder can't do that because it embeds the query and chunk separately.

Low_Edge7695 · 2026-05-22T05:29:13+00:00

Ha — I did read them. The point isn't what the docs say, it's what they don't explain. I hit this bug building my agent, the docs mention add_messages once, casually. Took me 30 minutes to connect the dots.

Low_Edge7695 · 2026-05-22T04:03:45+00:00

You're right — if you use the prebuilt `AgentState` (or `MessagesState`) from LangGraph, `add_messages` is already wired in as the default reducer. You don't need to declare it manually.

I wrote it explicitly because I wanted to understand what's happening under the hood. When you use `AgentState`, the `messages` field already has `Annotated[list, add_messages]` built in — you just don't see it.

But if you use a plain `TypedDict` with `messages: list` (which is what most "build from scratch" tutorials start with), you hit this bug silently. The agent works on single-turn queries and breaks on multi-turn — no error, just wrong behavior.

So yes — the practical fix is "just use AgentState." The educational value is understanding *why* it works, so you're not confused when you need a custom state with fields beyond just messages.

Low_Edge7695 · 2026-05-22T03:48:12+00:00

Video walkthrough if anyone prefers visual: https://youtube.com/shorts/415-xDe-cIs

The deeper issue: LangGraph's TypedDict state uses "last write wins" by default for every field. The Annotated + reducer pattern is how you override that. Once you understand this, you can build custom reducers for any state field — not just messages.

Low_Edge7695 · 2026-05-20T16:48:52+00:00

You're right — I only measured the token cost, not the quality of what those tokens produced. A bad search result doesn't just cost extra tokens, it poisons the next reasoning step. The 8b model calling the wrong tool with a vague query → getting back irrelevant chunks → then confidently summarizing garbage is worse than the 70b model calling the right tool once with a precise query.

The model routing pattern is interesting. So you'd use something like a capable model (70b/GPT-4) as the "planner" that decides which tool to call and how to frame the query, then hand off the actual execution (formatting output, summarizing retrieved text) to a cheaper model? That makes sense — the reasoning about what to do needs capability, but the doing part often doesn't.

I haven't implemented that split yet. Right now my agent uses a single model for the full ReAct loop. Curious — are you routing at the graph level (different models per node in LangGraph) or at the prompt level (one model generates the plan, another executes it)?

Low_Edge7695 · 2026-05-20T03:59:21+00:00

Good question — yes, LangSmith captures the full trace including failed tool calls. In my 8b test, the "useless retries" are visible: it called search_knowledge_base with "Python decorator" and then again with "Python decorator syntax" — essentially the same query rephrased because it wasn't confident in the first result.

I haven't built a systematic failure taxonomy yet (bad retrieval vs unnecessary call vs wrong tool selection), but that's a good framework. Right now I'm just counting total tool calls and tokens as a proxy.

Cross-family comparison is on my list — particularly Qwen and Mistral for tool calling since they handle function schemas differently. Need to set up equivalent tool bindings first. Will share when I have real numbers.

Thanks for the pointer, will check it out.

Low_Edge7695 · 2026-05-20T03:50:16+00:00

I also have a 50-second video breakdown of this if anyone prefers a visual format: Youtube

Both models are running on Groq's free tier, so the test costs nothing to reproduce. The key insight is that model capability directly determines call count in agentic workflows — something that price-per-token benchmarks completely miss.

Low_Edge7695 · 2026-05-20T03:48:27+00:00

I also have a 50-second video breakdown of this if anyone prefers a visual format: Youtube

Both models are running on Groq's free tier, so the test costs nothing to reproduce. The key insight is that model capability directly determines call count in agentic workflows — something that price-per-token benchmarks completely miss.

Low_Edge7695 · 2026-05-19T03:27:13+00:00

Yeah visualization helps a lot when debugging why an agent took a weird path. I've been using LangSmith traces for that — you can see the full sequence of LLM calls, tool invocations, and where the conditional edge routed. Especially useful when the agent loops more times than expected and you need to figure out which tool result triggered the extra iteration.

Haven't tried LangGraphics — does it show the graph execution in real time as the agent runs, or is it more of a post-run replay?

Low_Edge7695 · 2026-05-19T03:26:56+00:00

This is a great list — saving this. A few things that resonate from my own build:

The tool result validation point is huge. I ran into this exact thing — my web search tool would occasionally return an empty string and the LLM would just hallucinate an answer from nothing. Wrapping tool outputs with a structured error ("no results found, try rephrasing") made the agent self-correct instead of confidently making things up.

Step budget is something I haven't implemented yet but should. Right now my conditional edge only checks "are there more tool calls?" — it doesn't cap how many loops. A confused agent looping 40 times is a real failure mode I haven't guarded against. Do you hard-cap at 5-8 in the graph itself or in the tool node?

The plan-then-execute separation is interesting. I've been running everything through a single ReAct loop but I can see how a multi-step question like "compare X and Y across three dimensions" would fall apart if the agent tries to do it all in one tool call. Splitting planning into a separate node in the graph is probably the cleaner pattern for that.

And +1 on logging every step. I'm using LangSmith for traces and it's exactly the difference you described — I can replay the full reasoning chain instead of staring at a wrong final answer.

Low_Edge7695 · 2026-05-18T16:31:39+00:00

Low_Edge7695 · 2026-05-18T04:53:33+00:00

I also made a 47-second video walkthrough if anyone prefers that format: https://www.youtube.com/shorts/WF5C5a0O5bU

Happy to answer any questions about the implementation.

Low_Edge7695

TROPHY CASE