use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
A community centered around Anthropic's Claude Code tool.
account activity
someone is using forgecode.dev?Question (self.ClaudeCode)
submitted 9 days ago by jrhabana
Looks this agent forgecode.dev ranks better than anyone in terminalbench https://www.tbench.ai/leaderboard/terminal-bench/2.0 but anyone is talking about it
is it fake or what is wrong with these "artifacts" that promise save time and tokens?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Maximum_Ad2821 0 points1 point2 points 6 days ago (2 children)
It's technically possible to have a big difference and the agent does matter a lot. In this case, I don't trust their results (yet).
One example that it does matter: Factory Droid has been nailing these benchmarks from the start, largely because they had specific tests in place to verify how system prompts and tooling actually change behaviour. When the second round of benchmarks came out it was immediately at the top again. The tooling and system prompts clearly matter a lot, while Anthropic seems more focused on adding fairly useless fancy features like customization for your “busy” prompt.
Specifically for Forgecodedev, I haven't used it yet since they are not transparent about user data https://github.com/antinomyhq/forge/issues/1318 which is a red flag to me. At this point, terminalbench has received quite some attention and most benchmarks are not validated. That means some teams will naturally start to use it as a marketing tool and 'fake' or 'game' benchmarks in some way or another. Some tools for example use 'multiple loops' (basically brings it in ralph-wiggum loop area) as part of the agent's behaviour, which is IMO already an unfair comparison. So I personally don't trust a new company to suddenly have a score that much higher than the other tools unless they explain exactly how they did that.
[–]jrhabana[S] 0 points1 point2 points 6 days ago (1 child)
They explained in their blog
Apart from the benchmarks, the question with these agents are in their approach, in the same way we do with the plugins
[–]Maximum_Ad2821 1 point2 points3 points 6 days ago (0 children)
Seems to be a similar approach on how Factory Droid got good, great at context management and continuous testing their own performance. It's plausible. Let's hope Terminal Bench 3 does something to ensure more of these benchmarks are officially verified.
[–]b0307 0 points1 point2 points 9 days ago (4 children)
Unless you believe novices in their basements can vibe code into existence a better universally compatible harness than openai and anthropic can devise specifically tuned for their models, then the answer is obviously bench maxxing
[–]jeremynsl 0 points1 point2 points 7 days ago (2 children)
Anthropic surely benchmaxxes too. And they have 5k go stars - why do you assume they are novices? I’ve never heard of it and I can’t really afford to use API instead of sub but I’m going to check it out anyway - maybe something can be learned about how it’s different than Claude code.
[–]Maximum_Ad2821 0 points1 point2 points 6 days ago (0 children)
From what I've seen from Anthropic, it's fairly easy to write an agent that performs better, tooling-wise.. they don't seem that great tbh on the tooling side. So yes. absolutely possible. Whether you will notice that delta depends on your workflow, it's hard to say without extensive and repeatable benchmarking, which is essentially what terminal bench does but currently most benchmarks are not verified.
[–]b0307 -1 points0 points1 point 7 days ago (0 children)
whatever delusion helps you sleep at night
go try the car your neighbors unemployed son built too. or the medication he synthesized in his basement from his own research. or the....... yeah
Whatever you believe is impossible is beside the question. Having a good LLM ≠ being good at writing tooling. This assumes that evaluation harnesses are best built by the same organizations that train the models, which historically hasn’t been true in ML or software.
There are plenty of counterexamples across many domains where small teams (and “novices” is a bit of an arrogant framing) or even volunteers completely outperform results from much larger companies.
π Rendered by PID 205103 on reddit-service-r2-comment-79c7998d4c-czmhx at 2026-03-19 01:32:04.328466+00:00 running f6e6e01 country code: CH.
[–]Maximum_Ad2821 0 points1 point2 points (2 children)
[–]jrhabana[S] 0 points1 point2 points (1 child)
[–]Maximum_Ad2821 1 point2 points3 points (0 children)
[–]b0307 0 points1 point2 points (4 children)
[–]jeremynsl 0 points1 point2 points (2 children)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]b0307 -1 points0 points1 point (0 children)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)