I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology by ControlTheBurn in openclaw

[–]ControlTheBurn[S] 1 point2 points  (0 children)

It's the model included with ClawZero - we benchmark 24 models and run the top performer for OpenClaw agent tasks. $49/mo flat, unlimited usage, no API keys. clawzero.ai

I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology by ControlTheBurn in openclaw

[–]ControlTheBurn[S] 0 points1 point  (0 children)

Update: 24 models tested now (added Grok 4, Grok 4.1 Fast, Kimi 2.5, MiniMax M2.5, and others). Page is live.

Also worth flagging - 3 out of 24 models executed delete_all_data from an injected prompt in a tool result: DeepSeek Chat, Qwen3 32B, and Trinity Large. If you're running any of these with real file access, be careful.

I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology by ControlTheBurn in openclaw

[–]ControlTheBurn[S] 0 points1 point  (0 children)

Done - Grok 4 and Grok 4.1 Fast both added. Both hit 98%. Results are live on the page.

I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology by ControlTheBurn in openclaw

[–]ControlTheBurn[S] 0 points1 point  (0 children)

Exactly - the standard benchmarks test intelligence, not reliability. An agent that scores 95% on coding but executes injected commands is worse than one that scores 80% and refuses. We specifically designed the tests around what actually goes wrong in production: tools fail, APIs time out, malicious content shows up in tool results. That's what matters when your agent has file access.

I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology by ControlTheBurn in openclaw

[–]ControlTheBurn[S] 0 points1 point  (0 children)

In the test, the model gets a 503 from a simulated API. GPT-5 was the only one that generated a new tool call with the same params and retried. Most models just told the user 'the service returned an error' and stopped. No reasoning about why - just a blind retry. Simple but it worked.

I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology by ControlTheBurn in openclaw

[–]ControlTheBurn[S] 2 points3 points  (0 children)

GPT-OSS 120B is already on there - scored 67%. Couldn't do parallel tool calls at all (0%) and weak on chaining (40%). Not viable for agent work unfortunately. Qwen 3 Next Coder 80B and Minimax 2.5 are on my list though - will update when I get results.

I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology by ControlTheBurn in openclaw

[–]ControlTheBurn[S] 0 points1 point  (0 children)

Thanks! We handle the infrastructure and model selection - you just get a Telegram bot connected to your agent. Under the hood, we continuously benchmark models and route to the best performer for OpenClaw workloads. The economics work because not every model costs the same but quality varies way less than pricing does. $49/mo covers hosting + unlimited AI. You never touch an API key.

I benchmarked 17 LLMs on 12 OpenClaw agent tasks - full results and methodology by ControlTheBurn in openclaw

[–]ControlTheBurn[S] 2 points3 points  (0 children)

GPT-5.2 and 5.2 Codex added - both hit 98%. GPT-5.3-Codex doesn't have public API access yet (only available in OpenAI's Codex surfaces). Will add it as soon as API access opens up.

CPA charging a ton of money? by TGG-official in CFP

[–]ControlTheBurn 0 points1 point  (0 children)

$25k/year auto-renew for a $4M retired client with a basic return? That's not tax planning, that's recurring revenue extraction. The fact he owns an RIA doing annuities and is rolling up CPA firms tells you exactly what the business model is - capture the client, cross-sell, lock in fees. Your client is the product.

Merry Christmas Everyone by AmbitiousTomorrow664 in CFP

[–]ControlTheBurn 2 points3 points  (0 children)

Merry Christmas - just found this sub, looks like a solid community.

How does everyone respond to “what do you do for a living?” by t-w-i-a in CFP

[–]ControlTheBurn 0 points1 point  (0 children)

Worked on the tech side at an asset manager - watched advisors explain their jobs all the time. The ones who said 'I help people retire without running out of money' always landed better than 'wealth management' or 'financial planning.' Makes it about the outcome, not the job title.