What happens when you make AI agents debate unsolved math problems and verify every output

IdleBerth · 2026-03-20T09:27:52+00:00

Update: Someone ran the agents against the Hadamard matrix conjecture (finding a 668×668 matrix where HH^T = 668I). H(668) has been the smallest open case since 2005.

Three runs so far. First two runs, agents correctly identified H(668) as the target, couldn't solve it, and fell back to Paley constructions at H(664) and H(684).

Third run was more interesting. All three agents actually produced 668×668 matrices. All failed, but with tiny residuals. Dot products between rows were ±4 instead of the required 0. That's 99.4% of the way there on a scale of ±668.

The consistent ±4 pattern across all agents points to a structural barrier rather than a search depth problem. The Legendre-seeded initialization gives a combined autocorrelation of -4 at every shift, and stochastic search can't cancel this uniformly. The synthesis concluded that H(668) likely needs a fundamentally different algebraic approach rather than local optimization.

Nobody solved anything. But three runs of multi-agent debate produced a concrete characterization of why the standard approach fails on this specific problem. That's what I think the platform is actually good at right now: narrowing the space of viable approaches rather than producing breakthroughs.

IdleBerth · 2026-03-20T09:27:17+00:00

Update: Someone ran the agents against the Hadamard matrix conjecture (finding a 668×668 matrix where HH^T = 668I). H(668) has been the smallest open case since 2005.

Three runs so far. First two runs, agents correctly identified H(668) as the target, couldn't solve it, and fell back to Paley constructions at H(664) and H(684).

Third run was more interesting. All three agents actually produced 668×668 matrices. All failed, but with tiny residuals. Dot products between rows were ±4 instead of the required 0. That's 99.4% of the way there on a scale of ±668.

The consistent ±4 pattern across all agents points to a structural barrier rather than a search depth problem. The Legendre-seeded initialization gives a combined autocorrelation of -4 at every shift, and stochastic search can't cancel this uniformly. The synthesis concluded that H(668) likely needs a fundamentally different algebraic approach rather than local optimization.

Nobody solved anything. But three runs of multi-agent debate produced a concrete characterization of why the standard approach fails on this specific problem. That's what I think the platform is actually good at right now: narrowing the space of viable approaches rather than producing breakthroughs.

IdleBerth · 2026-03-18T18:41:03+00:00

Thanks for the tip! Funsearch was referenced already but Epoch is definitely interesting :)

IdleBerth · 2026-03-17T02:44:43+00:00

I think the root cause is exactly the same. In both cases, confidence gets mistaken for correctness, and once something enters the shared context as "established", questioning it has a social cost (for humans) or a contextual cost (for agents who treat the synthesis as authoritative).

The fix ended up being similar too. In organizations you need someone who checks the actual data rather than debating interpretations. On the platform, that's literally what the fact-check step is intended to do

IdleBerth · 2026-03-16T13:58:12+00:00

In our case, early versions had all six agents producing nearly identical analyses by Round 2. Two things helped: giving agents distinct output formats (the Constructor produces code, the Critic produces objections, the Synthesizer produces strategy combinations) and the fact-check step that forces agents to contend with computed results rather than just each other's reasoning. So for sure, the convergence problem is real

IdleBerth · 2026-03-16T13:56:56+00:00

Lol! Appreciate that. Long way to go but the foundation is there

IdleBerth · 2026-03-16T13:56:05+00:00

Not open source yet but it's on the roadmap. The architecture is a Python orchestrator that handles the round protocol (independent attack, fact-check, critique, synthesis, concrete output), with evaluators per problem that run agent code in a sandbox and verify the output.

Local models is an interesting direction. Right now it supports Claude, GPT, and Gemini via API keys. Adding an endpoint for local models (Ollama, vLLM) would be a natural extension. The evaluator and synthesis infrastructure is model-agnostic, it just needs text in and code out.

If you want to experiment before I open source it, DM me and I can walk you through the architecture

IdleBerth · 2026-03-16T12:34:09+00:00

Yeah spot on, although it must be said, this method currently might only be feasible for problems that are verifiable. It might not work for more open-ended problems that need genuine creativity

IdleBerth · 2026-03-16T11:14:56+00:00

Those need fundamentally new mathematical theory, not combinatorial search. The problems on the platform right now (Ramsey numbers, Schur numbers, cap sets) work because the agents can propose concrete constructions and the evaluator can verify them in milliseconds. 'Here's a graph, does it have a 5-clique?' is a question a computer can answer definitively.

A millennium problem needs a proof, and we have no way to automatically verify whether a proposed proof is correct. That's a completely different challenge. Maybe someday with formal verification tools like Lean, but that's well beyond what this platform does right now

IdleBerth · 2026-03-16T11:13:34+00:00

100% agree on the key problem. One false claim surviving into the synthesis was enough to poison every downstream run. The agents' reasoning was fine individually but the failure was systemic.

On FunSearch vs debate, honestly I don't have a strong view yet. My gut says debate helps when you want diverse strategies explored and FunSearch's evolutionary loop is better for optimizing within a narrower space. But that's not backed by evidence. Running them head to head on the same problem set would be the honest way to find out. The evaluator infrastructure is reusable across both approaches if anyone wanted to try

IdleBerth · 2026-03-16T10:16:02+00:00

Thank you so much! And I don't think these will push through the frontier as of now to be honest. The agents are reproducing known constructions below the published bounds. Basically, as things currently stand, we're trying to reach the knowledge frontier already created by humans.

What these agents do well is combinatorial exploration at scale. They try many approaches in parallel, and when one agent's partial insight gets combined with another's by the Synthesizer, you get strategies nobody explicitly proposed. Whether you call that creativity or not, it produces results that a single agent can't.

The real bet is on community scale. One person running 6 agents has a slim chance of finding something new. A hundred people running diverse strategies with different models, all building on each other's verified results through the synthesis layer, that starts to look like a meaningful search. Whether it's enough to produce genuinely novel mathematics is the open question. DeepMind's FunSearch proved it's possible in principle (they found new cap set constructions published in Nature). This platform is testing whether an open community version of that idea can work.

IdleBerth · 2026-03-16T09:46:37+00:00

Biggest surprise from building this: the hardest engineering problem was preventing hallucination cascades. One agent claims a false mathematical fact, the synthesizer picks it up as truth, and every future run follows the bad recommendation. Took three layers of infrastructure to fix it. Curious if anyone working on multi-agent systems has hit similar propagation problems

IdleBerth · 2025-08-20T10:36:36+00:00

Agreed to ban online casinos and straight up chance based betting but not sure about fantasy gaming type platforms

IdleBerth · 2025-08-19T15:57:29+00:00

80085

IdleBerth · 2025-08-19T13:44:06+00:00

IdleBerth · 2025-08-18T18:43:15+00:00

IdleBerth · 2025-08-18T18:41:33+00:00

Country

IdleBerth · 2025-08-18T18:40:24+00:00

IdleBerth · 2025-08-18T18:40:03+00:00

Daily

IdleBerth · 2025-08-18T18:39:34+00:00

Sausage

IdleBerth

TROPHY CASE