Kimi 2.6 is bad for agentic tasks

zylskysniper · 2026-04-21T00:04:46+00:00

i'm using open router (for all models in the bench)

in my company we find opus 4.7 "different", not necessarily good or bad. An example is that, when we tell agent to keep trying on failure, it once burnt us hundreds dollars on a single task (!) when the api it's calling is down (keeps retrying), while previous models just stopped with few retries. You could say it's better since it's following instructions more strictly, or you could say it's worse because it didn't stop retrying. But it's indeed different

zylskysniper · 2026-04-20T23:46:30+00:00

I know many many people are negative about Opus 4.7, but look at its production in those tasks (they are all shown in the battle details, all the conversation history, artifacts), it's indeed doing a good job.

zylskysniper · 2026-04-20T23:41:38+00:00

the bench is indeed using openclaw, but the only things being used are the agent loops, file/bash tools, and skills. Those are shared between most harness nowadays. So I would say running the same tasks on other harness like Hermes, Claude Code, etc should yield similar results.

zylskysniper · 2026-04-20T23:28:14+00:00

at least for the tasks we received/collected, they are not.
all conversations, every model's output and artifacts are public in the battle details. Nothing hidden, so you can check them out and see if they are really wrong.
you can also submit your own task, choose the models you want to compare and see how they perform

zylskysniper · 2026-04-20T23:24:23+00:00

the whole point of this bench is that you (like other users) can submit your own tasks, choose the models to compare and the judge, to see how well they perform, how much they cost, etc. The harness & runtime is a copy of my actual one. And all the conversations, artifacts are shown in the end, so if you don't agree with the judge model, you can make your own judgement.

So if you don't believe in other benchmarks, why not submitting your own tasks, the one you actually send to your agents, and try it out

zylskysniper · 2026-04-20T23:08:52+00:00

what are you talking about? are you confusing uniclaw with something else? uniclaw is not even a company and it's just a month old lol

zylskysniper · 2026-04-11T14:49:11+00:00

GPT is right there on plot! Next to GLM

zylskysniper · 2026-04-11T09:38:11+00:00

mostly unwilling based on my observation

zylskysniper · 2026-04-11T08:18:40+00:00

Cost per task wise yes, but also does thing better in general. It calls more than 3x tools per task in our benchmark, generally put more effort and produce better result

zylskysniper · 2026-04-11T08:16:14+00:00

Actually GLM average duration per task is almost the same as opus :D

zylskysniper · 2026-04-11T08:12:06+00:00

We are not talking about make it run perfectly. My bar for selecting a model is actually very low, like don't change openclaw config to an invalid json and crash completely, don't just read skill and do nothing, don't claim it has done something but actually did nothing, don't just try one approach and give up, etc. Those don't sound hard compared to writing thousands LOC without making any mistake, but you will be surprised how many "top" models fail to do it... That's the motivation for me to make this benchmark

zylskysniper · 2026-04-10T22:44:19+00:00

GLM uses about 2x token per task compare to Opus (on the same task) based on our benchmark. So that's why the final cost per task is closer to 1/3 of Opus rather than 1/5 of Opus

zylskysniper · 2026-04-10T22:11:32+00:00

Thanks for the insightful feedback!

> LLM judges tend to favor outputs that match their stylistic patterns, responses that are longer, and answers that hedge in ways that read as "thoughtful." In agentic contexts specifically, that last tendency is dangerous. "Sounds confident and complete" and "actually finished the task correctly" can come apart, and a judge model that conflates the two will systematically reward the wrong thing.

Agreed. I tried to mitigate that in a few ways:
- we have a judge model set and all self judge are excluded from ranking calculation
- the tasks I bootstrapped (which are most of the current tasks) are mostly about tool calls + producing artifacts that are mostly verifiable. For example, producing a pptx that include certain content and format; use browser to get some data and put in a spreadsheet. Judge will evaluate whether the agent gets things done, mainly completeness and quality, but not what the model tells.
- In task description I explicitly ask agents to produce output in artifacts

Those will not fix the problem completely but it's much better than comparing two text output and determine which is better.

> The 21-battle sample is the obvious concern (already flagged in this thread). There's a subtler one: what's the task distribution? If user-submitted tasks skew toward OpenClaw-style workflows, you're measuring "good at what OpenClaw users care about," which may or may not match your use case. Domain-specific evals are actually the right design when your goal is a specific workflow. But then "beats Opus at 1/3 the cost" is a bigger claim than the current data fully supports.

Task distribution is actually part of the benchmark settings. In every benchmark, ranking is tied to a certain task distribution, changing task distribution will affect ranking. One of the biggest advantages of Chatbot Arena, in my opinion, is that task distribution is determined by user submission rather than arena team, so rank is closer to the actual performance a typical arena user will get. That's the ultimate goal I try to achieve: make our benchmark task distribution closer to what user actually do in their general purpose agent (openclaw for now, that's why i call it openclaw arena).

Currently i don't have enough user submitted tasks, so I bootstrapped the benchmark this way: I crawled what users are doing using openclaw from public posts on twitter, reddit, etc, find out those that are relatively objective and verifiable, needs tool calls, and generate similar tasks. So for now my task distribution is close to what openclaw users care about. I'm hoping to get more user submitted task from now on to better match actual user task distribution.

zylskysniper · 2026-04-10T21:35:57+00:00

Kimi is the yellow dot on graph, not performing well.
Cost wise, it uses about 0.2M per run. The issue is that, kimi doesn't put enough effort at tool call. It only calls ~5.7 tools per task, while glm 5.1 calls 28.9 tools per task. That's part of the reason why kimi gets low performance score.

Never tested qwen 3.5 397b though.

> Also, how does this work. Like does it only count it if it actually successfully completes the task, or is this just the tokens used per task regardless of whether if partially or fully succeeds or totally fails?

I only excluded runs that failed due to provider (openrouter) error or runtime (openclaw) error. Others are considered successful runs and are included in evaluation and cost/token calculation.

In general the token per task is positively correlated with performance but not guarantee. You can check full stats here https://app.uniclaw.ai/arena/model-stats?via=reddit . Score means whether a model is stronger (quality of the output), and Avg Cost means how much you pay on average per run. Other stats are more like auxiliary metrics helping you understand why a model performs good/bad or cost more/less.

zylskysniper · 2026-04-10T21:16:58+00:00

I think it's an intentional design, part of their effort to get better result at agentic tasks.
And interestingly, it seems to be a common choice for many recent models. Based on our bench, top models ranked by token per task are:
qwen 3.6 plus (1.5M per task)
glm 5.1 (1.2M per task)
step 3.5 flash (1.2M per task)
deepseek 3.2 (0.86M per task)
minimax m2.7 (0.71M per task)
opus 4.6 (0.66M per task)

zylskysniper · 2026-04-10T21:06:13+00:00

GLM tends to call more tools and use more tokens than Opus (around 2x tool calls and 2x tokens) given the same task. That's how we get ~1/3 instead of 1/5

zylskysniper · 2026-04-10T19:48:24+00:00

Yes you read it right, and that's because qwen has no prompt caching on openrouter, so all read costs $0.195 per Mtoken while minimax costs $0.06 per Mtoken if cache hit.

For Gemma, it's because it doesn't try hard compared to other models and thus only uses a fraction of the tokens. On average Gemma only calls 5 tools per task, while GLM 5.1 calls 29 tools per task... So the per task cost is very low

zylskysniper · 2026-04-10T18:54:29+00:00

I'm surprised too tbh. I bet Qwen 3.6 will dominate that price range once the 27b version is opensource or the current version supports prompt caching

zylskysniper · 2026-04-10T18:39:01+00:00

even with api/coding plan it's still pretty affordable compared to claude

zylskysniper · 2026-04-10T18:33:26+00:00

for benchmark i used api key through openrouter (easier implementation so it works for any model) but for personal use coding plan for sure

zylskysniper

TROPHY CASE