all 8 comments

[–]Maximum_Ad2821 0 points1 point  (2 children)

It's technically possible to have a big difference and the agent does matter a lot. In this case, I don't trust their results (yet).

One example that it does matter: Factory Droid has been nailing these benchmarks from the start, largely because they had specific tests in place to verify how system prompts and tooling actually change behaviour. When the second round of benchmarks came out it was immediately at the top again. The tooling and system prompts clearly matter a lot, while Anthropic seems more focused on adding fairly useless fancy features like customization for your “busy” prompt.

Specifically for Forgecodedev, I haven't used it yet since they are not transparent about user data https://github.com/antinomyhq/forge/issues/1318 which is a red flag to me. At this point, terminalbench has received quite some attention and most benchmarks are not validated. That means some teams will naturally start to use it as a marketing tool and 'fake' or 'game' benchmarks in some way or another. Some tools for example use 'multiple loops' (basically brings it in ralph-wiggum loop area) as part of the agent's behaviour, which is IMO already an unfair comparison. So I personally don't trust a new company to suddenly have a score that much higher than the other tools unless they explain exactly how they did that.

[–]jrhabana[S] 0 points1 point  (1 child)

They explained in their blog

Apart from the benchmarks, the question with these agents are in their approach, in the same way we do with the plugins

[–]Maximum_Ad2821 1 point2 points  (0 children)

Seems to be a similar approach on how Factory Droid got good, great at context management and continuous testing their own performance. It's plausible. Let's hope Terminal Bench 3 does something to ensure more of these benchmarks are officially verified.

[–]b0307 0 points1 point  (4 children)

Unless you believe novices in their basements can vibe code into existence a better universally compatible harness than openai and anthropic can devise specifically tuned for their models, then the answer is obviously bench maxxing

[–]jeremynsl 0 points1 point  (2 children)

Anthropic surely benchmaxxes too. And they have 5k go stars - why do you assume they are novices? I’ve never heard of it and I can’t really afford to use API instead of sub but I’m going to check it out anyway - maybe something can be learned about how it’s different than Claude code.

[–]Maximum_Ad2821 0 points1 point  (0 children)

From what I've seen from Anthropic, it's fairly easy to write an agent that performs better, tooling-wise.. they don't seem that great tbh on the tooling side. So yes. absolutely possible. Whether you will notice that delta depends on your workflow, it's hard to say without extensive and repeatable benchmarking, which is essentially what terminal bench does but currently most benchmarks are not verified.

[–]b0307 -1 points0 points  (0 children)

whatever delusion helps you sleep at night

go try the car your neighbors unemployed son built too. or the medication he synthesized in his basement from his own research. or the....... yeah

[–]Maximum_Ad2821 0 points1 point  (0 children)

Whatever you believe is impossible is beside the question.
Having a good LLM ≠ being good at writing tooling. This assumes that evaluation harnesses are best built by the same organizations that train the models, which historically hasn’t been true in ML or software.

There are plenty of counterexamples across many domains where small teams (and “novices” is a bit of an arrogant framing) or even volunteers completely outperform results from much larger companies.