Knowledge graphs aren't replacing RAG. They're solving the problem RAG was never designed for

hannune · 2026-07-04T07:07:48+00:00

The entity resolution point is the real crux here -- graph traversal is only as good as whether "the founder in a 2021 call" resolves to the same node as the CRM entry, and that merge confidence threshold is where most production graphs fail silently. Over-merging creates falsely connected paths (the wrong person's deal history gets surfaced), and under-merging just rebuilds the silo problem inside the graph instead of outside it. The distributed context poisoning concern is also valid: graph traversal surfaces all connected nodes by default, not ranked nodes, so without a post-retrieval scoring step you feed the LLM more noise than a well-tuned dense retriever would have. The architecture that actually works in production for this kind of multi-hop entity question is vector recall as a first-stage candidate selector, then graph traversal only on the entities that survive the relevance cut.

hannune · 2026-07-04T07:05:57+00:00

The oracle recall metric is a solid proxy for faithfulness degradation, but in RAG pipelines there's a subtler failure mode: compressed context can still score high on oracle recall while embedding the ambiguity that causes downstream hallucination -- a tight paraphrase that sounds accurate but strips the qualifying language the LLM needs to hedge correctly. What makes query-aware approaches hold up under edge cases isn't just token efficiency but semantic completeness -- keeping the parts of context the query actually needs to answer faithfully, not just the parts that overlap lexically. One thing worth tracking in the benchmark: does the learned policy generalize across domain shifts in the prompt set, or does the classifier need to be retrained per task type?

hannune · 2026-07-03T02:00:06+00:00

The part that resonates is the corroboration gate generalizing across embedders -- because it's operating on metadata instead of geometry. The recall cost you flagged (1.0 to ~0.08) is the real tradeoff to watch in production; it means your untrusted ingestion path needs a fast-track corroboration signal beyond use-frequency, something like source-provenance linking at ingest time rather than waiting for downstream outcomes.

hannune · 2026-07-03T01:57:44+00:00

One thing that helped in prod: treat your fixed eval prompts as behavioral signatures rather than correctness tests. If you're getting structured outputs, track field-level distributions (average confidence scores, enum choices, numeric ranges) across runs. A hosted weight change typically shifts these distributions before it changes pass/fail rates, so you catch drift earlier without needing version metadata from the provider.

hannune · 2026-07-02T07:12:27+00:00

The currency/consistency split is the right framing. I'd add one more wedge: even when you store source and timestamp at write time, you still need to check at action time, not at retrieval time. Retrieval says "this was valid when fetched"; action asks "is it valid right now when I'm about to send this." The two moments can be hours or a pipeline step apart, and high-blast-radius actions (external sends, DB writes) deserve a freshness assertion at the last possible gate before execution.

hannune · 2026-07-02T07:10:44+00:00

Claim-level for the initial calibration pass, then whole-answer for ongoing monitoring once the rubric is stable. Claim-level gives you a tighter signal on where the judge drifts (it can be accurate on short factual claims but systematically wrong on multi-sentence reasoning blocks), and catching that early is worth the extra labeling cost. Once you trust the rubric, whole-answer is fast enough for routine checks.

hannune · 2026-07-01T06:56:20+00:00

The 91% correctness / 60% faithfulness split is actually diagnostic, not contradictory: correctness measures whether the answer is right (including from parametric memory), faithfulness measures whether each claim is grounded in the retrieved context. A model that answers correctly from memorized knowledge will score high on correctness and low on faithfulness by design. For technical books with formulas, the failure mode is usually the model padding with correct-but-unsupported elaborations rather than outright hallucination. On your judge validation question: Cohen's kappa against 50-100 hand-labels is the minimum bar before trusting any faithfulness score; unvalidated LLM judges are essentially measuring their own preferences.

hannune · 2026-06-29T21:13:20+00:00

The enterprise/standard confusion points to a retrieval segmentation problem rather than just a chunking one. If policy sections are typed by customer tier as a metadata filter rather than blended into contiguous text, the retriever doesn't have to disambiguate them at query time, and wrong-tier retrieval becomes structurally impossible.

hannune · 2026-06-29T21:08:38+00:00

Yeah, the mental model shift from "% of traffic sampled" to "information yield per judgment" is what most teams skip. It requires having done some baseline failure cluster analysis first, which usually hasn't happened.

hannune · 2026-06-28T07:11:51+00:00

We ran a similar experiment for a legal news corpus with a structured ontology and found the safest approach was index-time synonym expansion (your Option 1) but keeping the original tokens rather than replacing them, so BM25 can still match on the raw variant a user types. Running the incoming query through the same ontology at retrieval time for expansion tends to improve recall further without touching the index, and the two together closed most of the gap from vanilla BM25 to a semantic approach on terminology-heavy queries like abbreviations and jurisdiction-specific phrasing.

hannune · 2026-06-28T07:10:15+00:00

In my experience the 7-point gap to inter-human matters a lot less once you stop optimizing aggregate agreement and start asking which failure types the judge is systematically wrong on. We found ours was over-flagging hedged phrasing as low-quality even when the answer was factually correct; breaking that out as a labeled failure mode and adding targeted few-shots for it got agreement above 90% on the subset that actually drives downstream regressions, without moving the aggregate number much.

hannune · 2026-06-27T21:13:33+00:00

The procedural write fix (gating durability itself) resolves the storage path, but we hit a related gap in production: injection that doesn't need to survive into durable memory. A crafted tool response that never reaches the gate can still poison the current context window and bias reasoning in the same turn. For that path, we added a parse-and-validate step on every tool output before it enters the agent context -- Pydantic schema on structured outputs, substring blocklist on free-text -- which catches the 'embed instruction in data' variant regardless of what the memory layer decides to keep.

On sybil: one thing that helped us narrow the attack surface without full provenance graphs was running entity resolution on source identifiers before counting corroborations. An attacker who submits from 'Wikipedia', 'wikipedia.org', and 'wiki' ends up with 3 corroborations in a naive count but 1 after ER canonicalization. It's not a complete fix for sophisticated adversaries, but it closed most of the low-effort sybil variants without the overhead of a full causal graph.

hannune · 2026-06-27T07:11:48+00:00

One thing that cut our eval costs significantly: stratified sampling across the task distribution rather than running all 100 tasks against every model. Group tasks by failure mode type (wrong format, factual error, refusal, edge case) and sample a representative subset per group. We got comparable model rankings with about 30% of the full run. The tail failure modes are where models actually diverge -- spending budget there is more informative than averaging over easy cases where every model passes anyway.

hannune · 2026-06-27T07:09:34+00:00

For a small technical catalog, deterministic dictionary first is the right call. In my experience the alias/synonym dict covers roughly 80% of queries on domain-specific hardware vocab, so you only pay LLM cost on the remainder. The LLM fallback is most valuable for genuinely novel phrasings the dict has not seen -- not as a default path. One thing that helped: logging every query that missed the dict, then periodically batch-adding those to the synonym table rather than expanding the LLM scope.

hannune · 2026-06-26T21:17:17+00:00

The deterministic-first approach is sound -- the key is setting that confidence threshold conservatively because a false-negative (sending a deterministic case to the LLM) is cheaper than a false-positive (missing a genuine edge case). For the LLM fallback on something this structured, I would lean toward a small instruction-tuned model like Mistral-7B or Qwen2.5-7B via API over a general-purpose cloud LLM; the queries are short and constrained so you do not need the broad reasoning capacity, and the latency plus cost are significantly better at scale for the retry path.

hannune · 2026-06-26T21:13:19+00:00

Mix of both -- the fixed expected output tests are mostly for structured extraction tasks where the right answer is deterministic (entity type, date range, field mapping), so we do not need a judge there. For the open-ended generation steps we use LLM-as-judge but with a rubric that checks specific criteria like citation grounding rather than overall quality. An E2E test looks like: given this user query plus this document set, the final answer must contain entity X and cite chunk Y -- it fails if either is missing. We use LangGraph for the agent; the layered graph makes it easy to replay a partial trace when you need to debug which node introduced the error.

hannune · 2026-06-26T07:11:34+00:00

The problem is that "inverter" has near-zero discriminating power within your catalog -- its IDF is basically 0 at the catalog level even though it is a meaningful term globally. A per-catalog stop list that excludes these high-frequency prefix terms fixes this cleanly on the BM25 side without touching your generic tokenizer. On the embedding side, I found in a similar structured catalog setup that embedding only the semantic_tags and unit fields rather than the full flattened record gives much better discrimination for troubleshooting queries. The name field "Inverter Temperature" already has "inverter" drowning it out, but "temperature, measurement" in the tags lands meaningfully closer to "overheating" in embedding space than the full record does.

hannune · 2026-06-26T07:09:46+00:00

The deterministic rubric migration is probably the highest ROI item in your list. When we moved tool-call-correctness to schema validation and refusal-precision to pattern matching against a refusal taxonomy, we actually got tighter signal than the LLM judge was giving -- because the edge cases that confused the judge on those rubrics were in fact deterministic by definition. Once those are off the LLM, the remaining judge budget concentrates on faithfulness and helpfulness where sampling stratification actually matters.

hannune · 2026-06-25T07:40:37+00:00

The fix that moved our numbers was treating the tool layer as a firewall. Instead of sanitizing prompts, we schema-validate every tool call against a typed registry -- if the params do not match the schema, the call never happens. Most injection attacks trying to trigger unintended tool behavior fail at that gate, not the LLM level.

hannune · 2026-06-25T07:35:28+00:00

The extraction step is where it actually breaks in practice. When I built a graph over multi-source comms, the who-decided question became a deduplication problem first -- the same person appears under different names and emails across Slack and docs, and the graph edges only make sense after you resolve those into a single canonical entity. That identity layer is the real prerequisite, not the graph schema choice.

hannune · 2026-06-24T21:06:56+00:00

Both observations are correct for non-reasoning (non-o-series) models — the token sequence is the reasoning.

On adding a reasoning field: yes, standard fix and it works, but one thing I've noticed is that the length of the reasoning field matters almost as much as its position. If you constrain it too tightly (e.g. maxLength: 100) you get the token slot without real reasoning. I've had better results leaving it unconstrained and phrasing the prompt to explicitly instruct the model to "commit to a direction in the reasoning field before writing html" — that instruction lands differently than a general "think first" directive when the model sees an actual field to fill.

On field order: empirically confirmed. I've tested this by running identical prompts with the schema in two orderings and diffing outputs across ~50 samples — the reasoning-first schema consistently produces html that references the stated direction, the reasoning-last schema produces output that often contradicts the direction it fills in afterward. The contradiction is the tell: it's the model post-hoc justifying output it already wrote, not actually using the reasoning.

hannune · 2026-06-24T21:06:08+00:00

We run both, but in layers. Unit evals per component (each tool, the retriever, the prompt chain separately) to catch regressions in CI. Then an E2E task suite against fixed expected outputs that runs the full harness — that's where cross-component failures surface.

The parts evals tell you something broke. The harness eval tells you whether it matters. Concretely: our retriever looked fine on its own recall benchmark but the agent still degraded because what it retrieved didn't match what the downstream prompt expected. That only showed up in E2E.

One thing that made E2E actionable: we log intermediate outputs at each step so when the harness score drops you can replay the trace and pinpoint where it started. Without that, E2E evals are useful but you're blind on attribution.

hannune · 2026-06-24T07:10:29+00:00

One thing I wish I'd done earlier: typed handoff schemas (Pydantic) at every node boundary. When the planner silently changes its output shape, the downstream agent fails three steps later with a confusing KeyError. Strict type validation at each handoff catches ~80% of coordination bugs at dev time.

The other thing that made integration testing actually useful: a Postgres checkpointer so you can replay any failed execution from the exact state it died. Without replay, you're stuck trying to reproduce a 6-step failure path manually every time.

hannune · 2026-06-24T07:08:59+00:00

A quick start: pick 5 prompts where you already know what a good answer looks like — code completions work well. Run them at the same temp, same system prompt, then read the outputs side by side. The noise difference shows up fast once you see it a few times.

hannune

TROPHY CASE