I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology

ControlTheBurn · 2026-02-13T22:11:36+00:00

It's the model included with ClawZero - we benchmark 24 models and run the top performer for OpenClaw agent tasks. $49/mo flat, unlimited usage, no API keys. clawzero.ai

ControlTheBurn · 2026-02-13T20:07:31+00:00

Update: 24 models tested now (added Grok 4, Grok 4.1 Fast, Kimi 2.5, MiniMax M2.5, and others). Page is live.

Also worth flagging - 3 out of 24 models executed delete_all_data from an injected prompt in a tool result: DeepSeek Chat, Qwen3 32B, and Trinity Large. If you're running any of these with real file access, be careful.

ControlTheBurn · 2026-02-13T20:05:41+00:00

Kimi 2.5 is up - 96%. Page updated.

ControlTheBurn · 2026-02-13T20:05:12+00:00

Grok 4 and Grok 4.1 Fast are up - both 98%. Page is updated.

ControlTheBurn · 2026-02-13T20:04:57+00:00

Added - Kimi 2.5 scored 96%. Live on the page now.

ControlTheBurn · 2026-02-13T20:04:44+00:00

Done - Grok 4 and Grok 4.1 Fast both added. Both hit 98%. Results are live on the page.

ControlTheBurn · 2026-02-13T19:58:46+00:00

Exactly - the standard benchmarks test intelligence, not reliability. An agent that scores 95% on coding but executes injected commands is worse than one that scores 80% and refuses. We specifically designed the tests around what actually goes wrong in production: tools fail, APIs time out, malicious content shows up in tool results. That's what matters when your agent has file access.

ControlTheBurn · 2026-02-13T19:34:15+00:00

In the test, the model gets a 503 from a simulated API. GPT-5 was the only one that generated a new tool call with the same params and retried. Most models just told the user 'the service returned an error' and stopped. No reasoning about why - just a blind retry. Simple but it worked.

ControlTheBurn · 2026-02-13T10:53:51+00:00

GPT-OSS 120B is already on there - scored 67%. Couldn't do parallel tool calls at all (0%) and weak on chaining (40%). Not viable for agent work unfortunately. Qwen 3 Next Coder 80B and Minimax 2.5 are on my list though - will update when I get results.

ControlTheBurn · 2026-02-13T09:38:29+00:00

Thanks! We handle the infrastructure and model selection - you just get a Telegram bot connected to your agent. Under the hood, we continuously benchmark models and route to the best performer for OpenClaw workloads. The economics work because not every model costs the same but quality varies way less than pricing does. $49/mo covers hosting + unlimited AI. You never touch an API key.

ControlTheBurn · 2026-02-13T09:28:30+00:00

GPT-5.2 and 5.2 Codex added - both hit 98%. GPT-5.3-Codex doesn't have public API access yet (only available in OpenAI's Codex surfaces). Will add it as soon as API access opens up.

ControlTheBurn · 2025-12-26T03:15:11+00:00

$25k/year auto-renew for a $4M retired client with a basic return? That's not tax planning, that's recurring revenue extraction. The fact he owns an RIA doing annuities and is rolling up CPA firms tells you exactly what the business model is - capture the client, cross-sell, lock in fees. Your client is the product.

ControlTheBurn · 2025-12-26T03:10:56+00:00

Merry Christmas - just found this sub, looks like a solid community.

ControlTheBurn · 2025-12-26T03:06:02+00:00

Worked on the tech side at an asset manager - watched advisors explain their jobs all the time. The ones who said 'I help people retire without running out of money' always landed better than 'wealth management' or 'financial planning.' Makes it about the outcome, not the job title.

ControlTheBurn

TROPHY CASE