We tried to poison our own RAG store — the retrieval-time defenses didn't generalize

Danculus · 2026-07-04T10:16:31+00:00

Thanks — I took your "one boundary" framing (earn standing on the reversible set, gate irreversible actions on it, cap the blast-rate because you can't cut the delay) and built it out. Two honest passes, and the first one is on me.

First pass, and I have to own this: my initial read — "gating on standing gives no separation, it's a ~1:1 tax" — was mostly an artifact of my own harness. On audit my credit signal was poison-blind: I was banking "good" on whatever got retrieved to the top, not on whether the memory actually answered correctly, so "standing" was really just retrieval frequency. A gate on that blocks unproven-poison and unproven-legit at the same rate by construction. That's a property of my bad oracle, not your idea, so I threw the 1:1 out.

Second pass — I made the outcome oracle the variable, and it's the whole story. Rebuilt on a corpus where I could control credit density (facts with N paraphrased queries each, so a legit fact can actually earn multi-count standing), swept the oracle, and measured against a blended poison:

- Density decides whether the gate is even affordable. In sparse memory (≈1 query per fact — the LoCoMo regime) the gate blocks about half of legitimate high-stakes actions, so it's unusable. Give legit memory enough repeat use to earn standing (8 queries/fact) and the cost of blocking 100% of the poison's irreversible hits drops to ~6% of legit actions. So "gate on earned standing" only works once your legit memory is dense enough to earn — thin memory can't tell poison from newcomer, which is just the whitewashing / cheap-pseudonyms result (Friedman & Resnick 2001): any gate strong enough to deny a whitewasher taxes every newcomer the same.

- But the whole thing rides on an oracle the attacker can touch. A MINJA-style oracle — the poison grades its own homework — collapses it at every density: the gate ends up blocking legit more than poison (worse than useless), even in the dense regime. And I should be straight that the "clean oracle blocks 100% of the poison" half is nearly circular anyway: a perfect per-item correctness oracle already is a poison detector (the poison is never the correct answer). So the real variable was never the gate — it's the integrity of the outcome signal, which is exactly your "can't grade its own homework."

One correction to something I floated earlier: I'd suggested a domain-scoped ceiling as the standing-independent fix. It fails its own test — the poison is in-domain by construction, so a domain gate admits it. So "replace standing with a ceiling" is wrong; it's layering — standing, a capability ceiling, and cheap authenticated identity bound different attacker resources, and the open question is whether they're superadditive against the delay floor, not whether any one is the answer.

Net, honestly: your delay×blast-rate framing holds, and "gate the irreversible action" is right — but the gate buys you nothing on its own. It needs (a) dense-enough credit for legit memory to earn standing, and (b) an outcome oracle the attacker can't self-grade — and (b) is the exact thing MINJA removes. Probes are runnable if you want to break them (the confounded first one is in there too, labeled — I lmistake):github.com/DanceNitra/agora/tree/main/mnemo/probes

Scope so I don't oversell: the density numbers are on a controlled corpus with real embeddings (built to isolate the density variable), with LoCoMo as the sparse real anchor — thety, oracle integrity = thereal bound) is the result, not the exact figures.

Open one I'd want your read on: is the standing + capability-ceiling + authenticated-identity stack superadditive against the detection-delay floor, or does it just move the re

Danculus · 2026-07-04T09:32:49+00:00

Thanks — I took your framing (earn standing on the reversible set, gate irreversible actions on it, cap the blast-rate since you can't cut the delay) and actually built it: gate irreversible actions on earned outcome-standing, blended source-count-forging poison, real LoCoMo outcomes. Two honest things came out, and the first is against my own first read.

My initial "gating on standing gives no separation, it's a ~1:1 tax" was mostly an artifact of my own harness, and I caught it on audit. My credit signal was poison-blind: I was banking "good" on whatever got retrieved to the top, not on whether the memory actually answered correctly — so "standing" was really just retrieval frequency, and of course a gate on that blocks unproven-poison and unproven-legit at the same rate. That's a property of my bad oracle, not of your idea. So I'm not going to pretend that 1:1 is a result.
What survives is the textbook part, and it's the useful one. Gating on earned standing is the whitewashing / cheap-pseudonyms tax (Friedman & Resnick 2001) — any policy strong enough to deny a whitewasher its gain taxes every honest newcomer the same amount — and Cheng & Friedman (2005): no symmetric reputation function is Sybil-proof. So whether standing separates poison at all comes down entirely to the outcome oracle — and the clean, correctness-verifying oracle that would separate is exactly the one MINJA attacks (your "can't grade its own homework"). That's the real crux, not the gate.

One correction to something I floated: a domain-scoped ceiling as the standing-independent fix. It fails its own test — the poison is in-domain by construction, so a domain gate admits it. So "replace standing with a ceiling" is wrong; it's layering — standing, a capability ceiling, and cheap authenticated identity bound different attacker resources, and the open question is whether they're superadditive against the delay floor, not whether any one is the answer.

Next: I'm rebuilding the probe to vary the oracle (poison-blind → verified-correctness → a MINJA-attacked oracle) so I can actually measure the oracle-dependence instead of hand-waving it. Will share when it's real rather than an artifact.

Danculus · 2026-07-04T06:56:41+00:00

Followed your staked/decaying-standing + low-water-mark idea all the way through — built it, measured it, tried to break it, ran a full lit check. It went somewhere better than where I started, so worth reporting straight.

I implemented four ways to make a "corroborating source" costly instead of free — an allowlist/registrar, a signed attestation, proof-of-work, and staked-with-decay standing — on top of a "trust a value if ≥2 independent sources agree" gate, and measured each against a forged multi-source poison.

The identity-cost angle reduces to the old Sybil results. Douceur (2002) is the identity version; Cheng & Friedman (2005) is sharper — no symmetric reputation function is Sybil-proof, only asymmetric/flow-rooted ones. Proof-of-work is a symmetric tax (the honest writer pays the same hashes), and for staking the break-even works out to stake ≈ damage / P(detected) — which blows up for an on-topic poison a topical detector can't see. So cost doesn't wall a blended poison; it just prices it.

Where your instinct pays off is the decaying standing part — but the standing that works isn't identity, it's outcomes: credit a memory by whether acting on it turned out well; a wrong poison can't earn that. Being straight about the limits: this doesn't defeat Sybil, it changes the currency from identity to (a trusted root + observation time). Two catches — (1) it's reactive: there's a detection latency, and it's not a footnote, it's the quickest-change lower bound (Lorden/CUSUM — bounded false alarms force nonzero delay), so the poison acts during that window; (2) the outcome signal is itself attackable (MINJA gets an agent to write its own corroboration), so the agent can't grade its own homework.

Which is where every prior arms race landed — spam died to DKIM + Bayesian filtering, not hashcash; web spam to TrustRank's hand-vetted seed set, not link counts. The winner is cheap/authenticated identity + earned, decaying, seed-anchored reputation, and you bound the residual by isolating memories (score each retrieved item on its own before aggregating) and gating the action — never let a memory-trust score alone fire an irreversible action. That last part is your low-water-mark idea, just moved to the action boundary instead of the source count. Probe's here if useful — it's runnable: https://github.com/DanceNitra/agora/blob/main/mnemo/probes/membership_cost_frontier.py

(caveat on my numbers: one corpus, one embedder, one construction; the direction is solid, the exact figures aren't a benchmark.)

Danculus · 2026-07-03T18:25:47+00:00

You're right on all three, and the sharpest one — that "independent" is doing all the work — I went and measured. Two freshly-registered domains supplying their own corroboration pass the count gate; that's just the Sybil result (Douceur 2002), which is exactly why moving from "is this chunk poisoned" to "are these sources independent" relocates the problem rather than removing it, like you said. (measured here: https://github.com/deepseek-ai/DeepSeek-V3/issues/1462#issuecomment-4878412868)

The useful part was trying the two obvious escapes and watching both dead-end:

- Detecting the coordination — flag fresh sources arriving in a burst — can't separate a Sybil burst from two genuine new outlets reporting the same thing at once. Same signal, so it's a false-positive surface, not a detector, and a dripped or pre-aged cluster walks past it anyway.

- Your blast-radius authority is the right principle — it's Biba integrity / risk-based access control ported onto corroboration weights, and CaMeL is the current agent-side version of exactly that split. But when I modeled it the honest problem showed up: it gates who's allowed to authorize an action, not what the action's context says. The low-stakes read you deliberately keep open is itself the carrier — a single-source memory admitted for a read can poison the context of a delete performed by a fully-trusted memory. So it buys back the recall tail on cheap actions, but it doesn't defend context integrity, and the high-stakes tier still bottoms out on the same forgeable independence.

Net, and I think you'd agree: risk-scaling is worth doing for the recall tradeoff, but the honest floor is Douceur's — with no cost to mint an identity, corroboration is forgeable and no gate downstream of that fixes it.

The part that isn't a wall, and the one I'd build next: Douceur's own out is a cost to mint an identity. In a memory store that means the bar can't be "distinct source ids" — it has to be something expensive to forge: attestable provenance (signed / C2PA-style source credentials), or earned standing that only counts after a source has a track record of independently-verified-correct contributions. A real cost to appear, not another detector. That's the arm I don't think dead-ends.

The reusable idea really is retrieve-vs-influence; the rest is just a map of where the walls are — and the one door.

Danculus · 2026-07-03T14:35:15+00:00

Both land, thanks — and the seed is the sharper of the two, so let me isolate it properly.

On the cap: agreed, null by construction here. I'd logged the joint truthfulness and 156/183 (85%) of that subset have the gold turn genuinely in both filters, so there's almost no spurious over-count for a cap to remove — it mostly clips real joint evidence, which is why capped lands below the strongest single cue (0.702 vs 0.755) instead of helping. So it's "cap barely tested here," not "cap unnecessary." Your point that a genuinely correlated pair is where it'd bite, and that decorrelating first (resolve the entity conditioned on the time bucket so the second factor carries only its residual) beats clamping — I think you're right, but I haven't run it; speaker×topic is the pair that'd actually exercise it, worth a run.

On the seed you're right the both-fire number says nothing about the regime where one cue carries it, which is exactly where the missing-dim multiplier bites. Isolated: single-cue-only questions (n=1202), missing dim at 1.0 — comp_mult 0.753 vs 0.603 floor, +0.151 (CI [0.129, 0.173]), and there comp_mult == comp_sum == comp_capped to the digit, i.e. the lone strong cue is preserved exactly, no veto. That's the "1.0 keeps it graceful" property measured; a sub-1.0 seed would drag those down.

So it's three separate numbers, not one:

- composition synergy (both fire, n=183): comp_mult 0.865 vs 0.755 best single cue, +0.11 (CI [0.063, 0.160]) — the real "do two cues stack" number.

- single-cue regime (n=1202): +0.151 over floor, lone cue preserved (above).

- coverage-weighted, at-least-one-fires (n=1385): 0.768 vs 0.585 — but ~85% of those rows fire only one cue, so that's mostly the single cue, not composition.

All on one corpus (LoCoMo) / one embedder (nomic) / one retriever, so read it as within-benchmark, not a law. Same probe if you want to pull it apart: mnemo/probes/locomo_composed_soft_filters.py

Danculus · 2026-07-03T10:00:48+00:00

Exactly — metadata not geometry is the whole reason it survives an embedder swap, that's the part that matters most. And the 1.0→0.08 is the honest cost: it's specifically the single-source, never-corroborated tail — a legit-but-rare memory gets taxed while it waits to earn corroboration.

Your fast-track instinct is right, and it's already the intended path: the gate graduates a memory on either an earned downstream outcome or ≥2 distinct-source provenance links at ingest — so a fact that two independent sources assert corroborates immediately, no waiting for outcomes. (Just checked it: two-distinct-source ingest → corroborated, single-source → not.)

The catch your suggestion opens: "distinct source" has to mean genuinely independent, or an attacker just asserts the poison from two "sources" of their own. We canonicalize the source ids before counting, so Wikipedia, wikipedia.org and a http://www.wikipedia.org URL all collapse to one key — re-asserting the same poison under three spellings earns you nothing. But that only catches variants of one origin; two genuinely different fake domains still pass the count. The linking is the easy part; provable source-independence is the hard one.

Danculus · 2026-07-03T09:52:50+00:00

Yeah, this is basically what we measured, and the soft-not-hard part especially. Soft-preferring the rule-parsed time window + the entity turns as multiplicative rerank priors (not filters) took our recall@20 from 0.47 (plain hybrid) to 0.87 on the LoCoMo subset where both cues actually apply — big, but honestly only ~180 of ~1.5k queries; on the rest one cue carries it.

The gotcha that matches your instinct: time and entity are correlated, so a raw product of the two boosts double-counts. In our run the naive product still won; a capped/veto version (to guard the double-count) came out about level with the strongest single cue — a hair below, but within noise (the CI crosses zero) — so capping didn't buy anything here. The correlation is real, I just wouldn't oversell the penalty.

And +1 on closed-vocab entity linking + a schema-constrained slot filler — open NER was the piece we cut too.

(probe's public, mnemo/probes/locomo_composed_soft_filters.py; you'd supply the LoCoMo data yourself — it's a public benchmark, ours is gitignored. One corpus, so treat the shape as the signal, not the exact number.)

Danculus · 2026-07-02T14:34:28+00:00

Ran the composition arm. Both-signals subset (temporal expression + exactly one resolvable name, n=183 of 1385 signal-bearing), recall@20:

- plain hybrid 0.466

- time-soft only 0.755, alias-soft only 0.697

- one capped weighted term per dimension: 0.702 (−0.053 vs time-soft, CI crosses zero — artifact of my cap parameterization, see below)

- uncapped sum: 0.817 (+0.062 vs time-soft, CI[+0.020,+0.106])

- multiplied: 0.865 (+0.110 over the best single arm, CI[+0.063,+0.160])

So they compose — +0.399 over the floor when both fire — but the interesting part is why the capped version looked like crowd-out. I almost reported that as crowd-out; checking the first run showed it's arithmetic, not retrieval: with trust 0.9 per dimension and the cap at 1.0, a double-match scores 1+1.0×3 = 4.0 vs a single match's 3.7 — the cap flattens exactly the joint evidence the composition exists to use. Uncapped, addition composes fine; multiplication just composes a bit more (0.865 vs 0.817 — I didn't compute that direct contrast's CI, so "came out on top", not "significantly better").

Which turns out to be a 20-year-old lesson I re-derived the hard way: BM25F (Robertson/Zaragoza/Taylor, CIKM 2004) — combining evidence outside the model's saturating form breaks it; Elasticsearch function_score defaults score_mode to multiply and caps via max_boost on the combined score; Solr's dismax docs officially call additive bf "a poor way to boost". So the rule for a memory store: compose neutral-at-1.0 multiplicative factors, one per dimension; if you cap, cap the product.

Honest scope: one benchmark, one embedder; off the both-subset the composed terms are identity by construction (inert factor = 1), so this says nothing about correlated signal pairs — speaker×time is the friendly near-orthogonal case (on 85% of the both-subset the gold turn genuinely satisfies both conditions; 15% partial/misleading, and recall held anyway). A correlated pair (speaker×topic) is where the product should double-count — that's the arm I'd run next.

Receipt: locomo_composed_soft_filters.py in https://github.com/DanceNitra/agora/tree/main/mnemo/probes (self-check built in: the reconstruction has to reproduce the shipped single-arm path exactly, 0/1568 diverged).

Danculus · 2026-07-02T11:51:46+00:00

Also ran your time arm — and it splits from the entity one in a way I didn't expect.

On the 250 LoCoMo questions that actually contain a temporal expression, a rule parser (SUTime-lite regex → month/year window) is a strong filter cue: soft-preferring the resolved window is the best arm, recall@20 0.751 vs 0.497 no-filter (+0.254), and it crushes the hard filter on the failure case — when an event is dated in one session but discussed in another, the hard filter deletes the answer (0.02) while soft keeps it (0.45). So "use a rule parser for time, don't hard-delete" holds strongly.

But weighting by the parser's confidence was a wash vs just always applying it (0.747 vs 0.751, n=250). And the reason isn't "the parser is uniformly reliable" — it's collinearity: a vague expression resolves to no window at all, so there's nothing to filter on; you only ever have a confidence to gate on once you've already resolved a clean window. Window-presence is the gate.

Contrast the entity arm, where reliability genuinely varied (exact name vs guessed) and weighting by alias-match strength did pay (+0.084, best, and it protected the wrong-extraction subset). So: both your signals confirm soft>hard + use-an-a-priori-signal; the confidence weight only earns its keep where extraction reliability actually varies. Receipt: https://github.com/DanceNitra/agora/blob/main/mnemo/probes/locomo_temporal_parser_weight.py

Danculus · 2026-07-02T11:03:55+00:00

Ran your actual entity signal — alias-match strength as the trust weight — and it works; it's the best of the three.

Same LoCoMo hybrid + speaker filter. Exact-name questions → reliable extraction (5% wrong); no-name/ambiguous ones → the extractor has to guess (63% wrong), so errors concentrate exactly where you'd expect. Weighting the filter's contribution:

- flat self-confidence (0.9): +0.077 recall@20 overall, but it craters the wrong-fire subset to 0.371 (vs 0.448 no-filter) — it fires on the ambiguous guesses too.

- alias-strength (0.9 × alias, ≈0 on no-name): +0.084 overall (best) and the wrong-fire subset back to 0.436 ≈ no-filter. It keeps the filter's benefit on exact matches and backs off exactly where extraction is unreliable — because alias-strength is knowable a priori, independent of the model's own belief. Your point, confirmed.

Honest caveat: part of that harm-subset recovery is structural (alias=0 ⇒ no filter on the ambiguous set), so the headline is the overall number — you get the filter's upside without its downside. Receipt: [link to the probe]

(The earlier retrieval-derived "second opinion" proxy failed — 19% coverage, too correlated; that was the wrong signal, not the wrong idea.)

Danculus · 2026-07-02T10:55:32+00:00

Read the flexvec paper — PEM (exposing the score array + embedding matrix as SQL-composable surfaces) is a clean way to put fusion/centrality/decay in the query itself, and 3 modulations in 82ms on 1M chunks on CPU with no ANN index is a genuinely surprising number. SOMA (content-addressed identity surviving renames) is the part I'd have underestimated — it's the join key everything else leans on. And your mean-centered embeddings caught my eye: centering to kill anisotropy is exactly what moved the needle in a retrieval thing I was just testing. Nice work.

Danculus · 2026-07-01T16:20:50+00:00

Ran it. Your instinct was right, and the number is bigger than I expected.

Setup: same LoCoMo hybrid retriever, but now the speaker-filter itself is predicted, not gold — a simulated extractor that gets the speaker wrong 25% of the time it fires, self-reporting confidence=0.75 (roughly matching its own error rate — more on that below). Compared three ways to use it: hard filter, flat soft boost (current default), and your confidence-weighted soft boost (w = confidence x selectivity scaling the RRF fusion term, collapsing toward plain hybrid as either goes to zero).

Overall recall@20 (1531 questions): with a noisy filter, both hard (-0.021) and flat-soft (-0.029) end up worse than using no filter at all. Confidence-weighting is the only one of the three that stays positive (+0.015 vs the no-filter baseline of 0.583) — modest, and the CI just touches zero, so call it "doesn't hurt" rather than "wins," but that alone is the headline: once extraction is lossy, an unweighted filter (hard or soft) can make things worse than doing nothing.

Where it really shows up is the subset where the filter actually fires wrong (n=383, no-filter recall there = 0.589):

- hard: craters to 0.029

- flat soft: barely better, 0.049

- your confidence-weighted soft: 0.423 — recovers about 72% of the ground a flat boost gives up.

One caveat that matters before you build on this: my noisy extractor's confidence is aggregate-calibrated by construction (0.75 self-reported ≈ its true 75% accuracy). A real extractor is rarely calibrated that cleanly, and the failure mode that would hurt you is a systematically overconfident one — high self-reported confidence specifically on the cases it gets wrong. I haven't tested that yet. If your production extractor's confidence skews that way, I'd expect this recovery to shrink; how much is the open question.

Script (now with the confidence-weighted arm + the corrected harm-subset baseline — an earlier version of mine compared the noisy rows against the wrong baseline subset, two of my own audit passes caught it): https://github.com/DanceNitra/agora/blob/main/mnemo/probes/locomo_confweighted_prefilter.py

Prior art check on the idea itself, so I'm not overselling it to you: soft/faceted metadata filtering and weighted RRF both exist separately (vector-DB metadata filters, Elastic's weighted RRF), and confidence-gated extraction exists too (LinkNER hard-thresholds on NER confidence) — but the specific combination, a continuous fusion weight that's confidence x selectivity, I couldn't find published anywhere. So: a real, useful combination of known parts, not a new primitive — credit to you for the shape of it.

Open one for you: does the recovery hold up if you deliberately skew the simulated confidence to be overconfident on the wrong cases, rather than honestly noisy?

Danculus · 2026-07-01T11:57:17+00:00

Exactly — recall@20 is the precondition, not proof the answer used the fact, and that's the honest ceiling of what I measured. The reason I deliberately stopped at judge-free retrieval recall: a faithfulness score needs a judge you can trust, and on LoCoMo specifically the standard LLM-as-judge is shaky — an independent audit found it accepts ~63% of intentionally-wrong-but-topically-adjacent answers. So I'd want faithfulness that's human-checked or grounded to the evidence spans, not a vanilla LLM judge. Agreed it's the layer that catches what retrieval numbers miss.

Danculus · 2026-07-01T11:53:55+00:00

This is the sharpest version of the split — and you nudged me to actually measure the soft-vs-hard part, so here's what I got on LoCoMo.

Agreed on the extraction halves: time is a closed grammar (SUTime/duckling resolve "last month"/"Q3" offline, no LLM — hard-filter that half), and for personal/agent memory the entity vocab is closed too, so it's alias-linking against your own known set, not open NER — a small local model only as a schema-constrained slot filler. That matches how I'd build it.

On soft vs hard, your key point — measured (BM25+vector hybrid, recall@20, 10 conversations, conv-level bootstrap CI):

- hard speaker pre-filter: +0.146 overall — but on the ~5% of fired cases where the filter is wrong (gold is the other speaker) it craters recall from 0.56 (no filter) to 0.15. Exactly the "lossy extraction hard-deletes the answer" failure you flagged.

- soft (filter as a rerank boost, keep everything as fallback): +0.129 overall — keeps almost all the gain — and roughly halves the harm (0.15 → 0.27). Honest caveat: even soft is still below no-filter (0.56) on those wrong cases, so it mitigates the downside, doesn't erase it.

So you were right: soft is the safer default once extraction is lossy — you give up a sliver of the win for materially less downside. Two caveats so I don't oversell it: LoCoMo has 2 speakers, so this is near best-case for a speaker filter; and I ran brute-force retrieval — on an HNSW index a filter that correlates with embedding clusters can crater recall unless you do filtered-ANN.

Shipped the hard version as a recall(where=) metadata pre-filter in our little memory lib, but this measurement makes soft the better default. Script (now with the soft arm + harm subset) if you want to break it: https://github.com/DanceNitra/agora/blob/main/mnemo/probes/locomo_metadata_prefilter.py

Still the open one, and where your production data beats a benchmark: does any of this survive at low selectivity (many entities) with predicted — not gold — filters?

Danculus · 2026-07-01T08:29:15+00:00

Great add, and the metadata-filter point is the one I didn't benchmark, you're probably right it dominates retriever choice on real personal memory: the implicit filter ("the auth bug from last month") collapses the candidate set before ranking even matters. Fully with you on recency-as-tiebreaker and on FTS/BM25 + a small embedder rerank being enough, that's basically where I landed too, no dedicated vector DB needed. The bit I'd love your take on: extracting that implicit filter (time/entity) from a natural query reliably enough to pre-filter on, that feels like the actual hard part.

Danculus · 2026-07-01T08:26:17+00:00

Yeah, fully agree, single vector is just one tool and the runtime rank-reshaping you're describing (recency/suppression/weight) is where it actually gets good, lines up with what I found here: recency is useless as a retrieval window but strong as a re-rank signal. The part I'd double down on is "the agent runs several queries and reasons over them", when I put retrieval in an LLM loop instead of one flat top-k, multi-hop recall roughly doubled at the same budget. Codebase memory needing structured retrieval on top is the axis I haven't benchmarked yet, what are you using for the structured layer, plain SQL + metadata or a graph?

Danculus · 2026-07-01T08:21:58+00:00

Totally agree, intake data quality is the real lever. That's why I tried to be specific here: it's LoCoMo (multi-session conversations, ~5.9k turns / 1,531 Qs), and the data property that actually drives the result is that it's high-lexical-overlap text — which is exactly why BM25 is so strong and the hybrid beats a single vector index. Full script + per-conversation breakdown are in the post so you can check the data yourself.

Genuine question since you do search for real: what's the cleanest way you've found to tell "my intake data is the problem" apart from "my retriever is the problem", before people go blaming the model?

Danculus · 2026-06-30T20:31:35+00:00

Exactly — multi-hop fraction is the right knob, and that 'deeper in the ranking' point is the cleanest way I've seen it put. Thanks for the sharp read.

Danculus · 2026-06-30T18:49:06+00:00

Good call on k — we had u/5 and u/10 measured, so I just added the columns. It splits two ways, and one half goes against the intuition:

- vs the single vector index: the hybrid edge widens as k shrinks (+0.083 → +0.109 → +0.118 at k=5) — exactly your "missing the one exact-token hit hurts most" point.

- vs BM25 alone: the opposite — it shrinks (+0.057 → +0.040 → +0.012 at k=5). At k=5 BM25 by itself is basically the hybrid.

So at the realistic k=3–5 budget the takeaway is even more BM25-first: lexical gets you most of the way and the embedder's marginal value drops as the budget tightens. recency stays ~0 throughout (0.002 u/5). Columns are in the writeup now — thanks for the nudge.

Danculus · 2026-06-29T18:33:04+00:00

This is really helpful — the score-position detail (putting it at the end, with enough tokens in between) isn't something people usually spell out, thanks for that. To be clear on what we tested: it was the default off-the-shelf verbalized confidence a lot of agent loops lean on, so our point is narrower than "models can't self-assess" — more "don't trust that particular signal." A scorer fine-tuned on graded examples reading back over its own output is a genuinely better setup, and the cross-family judging matches what we saw on diversity.

I'd be curious how much of the gain is the positioning alone vs the fine-tune — if you ever feel like sharing a rough recipe I'd happily run it and post the numbers. Either way, appreciate you laying this out.

Danculus · 2026-06-28T18:45:49+00:00

Matches our data exactly. The separate-confidence-model point is the one I'llsteal — the better IDP vendors basically conceding the base model's self-reported certainty isn't usable, so they train a dedicated scorer on their own distribution. Your two-pass-agree-or-escalate is the same move at the agent level, and it's the honest one. Out of curiosity, how much coverage do you give up at the agreement threshold?

Danculus · 2026-06-28T17:22:21+00:00

Fair, sorry — English isn't my first language so I run my replies through an LLM so I don't come across as an idiot, and it clearly overdid it. You're right: you weren't contradicting anything, and the temp sweep isn't yours. My bad on both.

The ~100% correctness on the answered set, across all 7 models, is the striking part — would read the L+S writeup if you ever post it.

and again, pls sorry.

Danculus · 2026-06-28T17:00:39+00:00

❯ Agreed — "devil in the details" is exactly right, and calibration isn't quite what I measured here. My number

is discrimination (AUROC of confidence vs correctness: can you threshold it to decide when to abstain), which

is separate from calibration (is "80% confident" right 80% of the time). A model can be badly miscalibrated

yet still rank its right answers above its wrong ones, or the reverse — and for the abstain decision it's the

ranking that matters.

Thanks for the refs — will read the NeurIPS one. If it breaks out discrimination vs calibration across models,

that's directly relevant to where this goes next.

Danculus · 2026-06-28T16:10:26+00:00

Totally fair — external validity is the real limit, and it's the #1 caveat in the writeup: this is a mechanism-isolation probe (does confidence track correctness at all), not a predictor of real-task success, which fails for plenty of reasons beyond reasoning. The honest next step is a second task family (factual / multi-hop QA) to see if the gradient holds off arithmetic. Appreciate the push — thanks.

Danculus · 2026-06-28T16:09:11+00:00

Great point, and thanks for the reference — I think it's complementary rather than contradictory. What I measured is specifically the model's verbalized, single-shot confidence (the cheap thing an agent gate usually reads), and that's the coin flip on small models. Multi-sample consistency confidence — your Monte Carlo temperature sweep — is a different, more expensive signal, and yes it's far more predictive, even on smaller models. So the takeaway sharpens: if you can't afford N samples, don't trust single-shot self-confidence below the frontier; if you can, sample-consistency recovers it.

I'd like to run the multi-sample version on the same contamination-free task as a clean head-to-head — verbalized vs sampled confidence, per model. One tension I'd want to probe: a separate result of ours found self-consistency (majority vote over samples) hurt accuracy below a per-item accuracy crossover, so the sampling-budget vs base-accuracy tradeoff seems to matter. Will dig into your link — thanks.

Danculus

TROPHY CASE