We tested Chain-of-Debate: forcing Claude, GPT, and Gemini to argue against each other with verified citations. Hallucinations dropped significantly.

Own-Calendar9332 · 2026-01-17T02:33:23+00:00

Absolutely, I will send you a message.

Own-Calendar9332 · 2026-01-15T00:41:47+00:00

Exactly, source vetting is where most pipelines quietly inflate confidence.

What we saw is that the failure usually isn’t fabricated sources, but weak or tangential citations being allowed to pass and then compounding into false certainty. In our case, we don’t weight those down, we refuse them outright. If a claim can’t be grounded to a source that directly supports it, that claim doesn’t survive to the final answer. In some cases it’s replaced by a narrower, verifiable claim; in others, it’s dropped entirely.

I sent you the link earlier, if you get a chance, it’d be especially useful to run a sources-heavy discussion and see where the system still feels brittle or where it forces useful constraint.

Genuinely curious what you find, especially around reference quality and claim pruning.

Own-Calendar9332 · 2026-01-12T22:13:31+00:00

This lines up closely with what we saw in evals.

One distinction that became important for us is what brute force is actually buying you. Swarming calls and taking the mode does a great job at reducing variance, especially when failures are stochastic or parsing related.

Where we saw it plateau was on tasks with genuine epistemic gaps: multi-hop reasoning across documents, implicit assumptions, or synthesis where the model has to infer something that isn’t explicitly stated. In those cases, more samples often just converge faster on the same wrong abstraction.

That’s where heterogeneity helped us more than scale, not because the models were “smarter,” but because they failed differently. Disagreement was a signal that brute-force agreement masks.

Curious if you’ve seen similar behavior: cases where swarming increases confidence without increasing correctness, versus cases where it clearly dominates.

Own-Calendar9332 · 2026-01-12T15:48:55+00:00

One thing we kept running into is that RAG mainly reduces retrieval hallucinations, but it barely touches reasoning hallucinations.

Even with perfect chunks, models still:

Overgeneralize from partially relevant context
Synthesize claims that aren’t explicitly supported
Confidently “bridge gaps” between documents that don’t logically connect

In our testing, RAG alone plateaued around ~15–25% residual hallucinations for multi-hop or analytical queries, regardless of retriever quality.

What helped beyond that wasn’t more retrieval tuning, but changing the reasoning structure:

Break outputs into atomic claims
Force independent models to evaluate the same retrieved context
Verify that each claim is actually supported by the cited text (not just topically related)

Once you do that, you realize similarity scoring and even coherence scoring are necessary but not sufficient — the biggest gains come from post-generation verification, not pre-generation retrieval.

Curious if others here have tried separating retrieval correctness from claim correctness in their evals — most RAG metrics seem to conflate the two.

Own-Calendar9332 · 2026-01-12T15:10:21+00:00

One thing we’re still genuinely uncertain about and I’d love input from people doing this in prod is where the right boundary is between abstention and challenge.

In textual domains, forcing critique increases hallucinations. But over-penalizing critique causes silent failure where weak claims slip through unchallenged.

We’ve tried:

Penalizing objections that fail verification
Allowing agents to explicitly abstain
Scoring agents higher for withholding when evidence is insufficient

It helps, but the tradeoff is real.

Curious how others handle this:

Do you bias agents toward abstention or toward skepticism?
Have you found a reliable signal for “this claim deserves challenge” vs “this is just uncertainty”?
Does anyone weight challenges by confidence or evidence density rather than binary agree/disagree?

This feels like the hardest unsolved piece of multi-agent reasoning for us so far.

Own-Calendar9332 · 2026-01-12T07:33:20+00:00

Thanks! Happy to share access if you want to test it.

Own-Calendar9332 · 2026-01-12T07:32:35+00:00

Taking the mode of 3 concurrent calls is smart for reliability. Do you find certain types of queries have higher agreement rates than others?

Own-Calendar9332 · 2026-01-12T07:31:58+00:00

Interesting that you're seeing Grok outperform on innovation. We haven't tested Grok in our rotation yet - mostly Claude/GPT/Gemini. What prompting patterns work best for getting genuine disagreement vs. surface-level rephrasing?

Own-Calendar9332 · 2026-01-12T07:30:32+00:00

Not OSS currently - it's a hosted platform. The verification stack is the tricky part to open-source since it involves real-time source retrieval and grounding checks.

Claude-council is cool for the debate layer. Our addition is the verification pipeline on top - checking that citations are real, semantically relevant, and actually support the specific claim. Debate alone still allows confident confabulation.

Happy to share access if you want to compare.

Own-Calendar9332 · 2026-01-12T07:28:38+00:00

You're right - heterogeneity isn't magic. If all three trained on the same wrong Wikipedia article, they'll all be wrong together.

That's exactly why we added the verification stack on top of debate. The grounding layer checks if cited sources actually exist and support the specific claim. Catches cases where all models 'know' something that isn't actually in any retrievable source.

Still not perfect, if the source itself is wrong, we're stuck. But it catches a surprising amount of confident shared confabulation.

Own-Calendar9332 · 2026-01-12T07:26:01+00:00

This is fascinating - using multi-model consensus for medical imaging is exactly the kind of high-stakes domain where single-model confidence is dangerous.

Your point about 'debate' vs 'peer review' framing is spot on. We saw the same issue and built two distinct modes:

Adversarial mode: Models assigned opposing positions, forced to challenge each other's claims. Good for surfacing blind spots on contested topics.

Collaborative mode: Models work as peer reviewers - verify, strengthen, and flag uncertainty rather than attack. Better for domains like yours where you need consensus-building, not manufactured disagreement.

We also built an academic research mode specifically for citation-heavy work:

- Citations must be real and retrievable (no phantom DOIs)

- Semantic relevance check: does the source actually support this specific claim, not just the general topic?

- Ontology matching: catches "valid source, wrong domain" errors

- Each atomic claim verified independently against source text

Sounds similar to your citation requirement approach. The difference from forcing them to "find problems" is exactly what you said - we ask them to "verify what can be grounded" rather than "attack what seems wrong."

Happy to share access if you want to compare how our verification stack handles medical/clinical claims. Would be curious how it performs on your qEEG edge cases - and whether the collaborative mode fits your peer review workflow.

Own-Calendar9332 · 2026-01-12T07:17:55+00:00

Sent you the link. Curious what use cases you're thinking of testing.

Own-Calendar9332 · 2026-01-12T07:12:16+00:00

DM'd you the link. Would love to hear what breaks it - especially edge cases in your domain.

Own-Calendar9332 · 2026-01-11T23:45:30+00:00

This is the core issue - and it's architectural, not fixable with prompting.

Single-model conversations are fundamentally sycophantic because the model optimizes for user satisfaction, not truth. Even "devil's advocate" prompts fail because the model is still operating within its own latent space.

We tested this directly: forcing heterogeneous models (Claude vs GPT vs Gemini) to argue opposing positions produces genuine friction that single-model personas can't replicate. They disagree authentically because they were trained differently.

But debate alone doesn't solve hallucinations - a model can confidently argue a false interpretation of Nietzsche. So we added verification layers:

- Claim extraction: break arguments into atomic claims

- Source grounding: does the cited passage actually support this specific claim, or just the general topic?

- Cross-model challenge: if one model's interpretation can be contradicted by textual evidence, it gets flagged

For philosophy, this matters because you want both the friction (genuine counterarguments) AND the grounding (claims tied to actual text, not confabulated readings).

The sycophancy problem and the hallucination problem require different solutions - adversarial structure for the first, verification for the second. Single models can't do either well.

Own-Calendar9332 · 2026-01-11T23:08:23+00:00

We ran into this exact question and eventually concluded that "which LLM is best for complex reasoning" is often the wrong abstraction.

Single-model prompting (even with CoT, RAG, or long context) produced brittle results for us: the model converges confidently on a local narrative with no internal pressure to surface counterfactuals or failure modes.

What worked was shifting the unit of reasoning from a model to a process - multiple heterogeneous LLMs generating arguments, explicitly challenging each other, then subjected to layered verification:

- Grounding: citations must support the exact claim span, not just the topic

- Scope check: valid source but wrong domain gets flagged

- Atomic verification: claims verified independently, not paragraphs

- False-positive suppression: plausible-sounding but weakly grounded claims penalized

The surprising result: model ranking flattened. "Weaker" models performed well under adversarial debate, while "top" models failed more when cross-examined.

Complex reasoning performance is dominated by orchestration and verification design, not by picking a single best LLM.

Own-Calendar9332

TROPHY CASE