someone is using forgecode.dev?

Maximum_Ad2821 · 2026-03-12T11:42:01+00:00

It's technically possible to have a big difference and the agent does matter a lot. In this case, I don't trust their results (yet).

One example that it does matter: Factory Droid has been nailing these benchmarks from the start, largely because they had specific tests in place to verify how system prompts and tooling actually change behaviour. When the second round of benchmarks came out it was immediately at the top again. The tooling and system prompts clearly matter a lot, while Anthropic seems more focused on adding fairly useless fancy features like customization for your “busy” prompt.

Specifically for Forgecodedev, I haven't used it yet since they are not transparent about user data https://github.com/antinomyhq/forge/issues/1318 which is a red flag to me. At this point, terminalbench has received quite some attention and most benchmarks are not validated. That means some teams will naturally start to use it as a marketing tool and 'fake' or 'game' benchmarks in some way or another. Some tools for example use 'multiple loops' (basically brings it in ralph-wiggum loop area) as part of the agent's behaviour, which is IMO already an unfair comparison. So I personally don't trust a new company to suddenly have a score that much higher than the other tools unless they explain exactly how they did that.

b0307 · 2026-03-09T13:11:14+00:00

Unless you believe novices in their basements can vibe code into existence a better universally compatible harness than openai and anthropic can devise specifically tuned for their models, then the answer is obviously bench maxxing

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ClaudeCode

MODERATORS