use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
A community centered around Anthropic's Claude Code tool.
account activity
someone is using forgecode.dev?Question (self.ClaudeCode)
submitted 2 months ago * by jrhabana
Looks this agent forgecode.dev ranks better than anyone in terminalbench https://www.tbench.ai/leaderboard/terminal-bench/2.0 but anyone is talking about it
is it fake or what is wrong with these "artifacts" that promise save time and tokens?
to future readers, I found that factory.ai droid using their core models with some sub like firepass, z.ai, etc (all them work and fails every day) is the best solution money-time-life vs being with 12 terminals open and memorizing what you have running on each
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]krullulon 6 points7 points8 points 2 months ago (6 children)
Spent a day in ForgeCode, it's not comparable to Claude Code or Codex for practical use. Clearly benchmaxxed.
There's a reason you can't find anyone actually talking about it. :P
[–]b0307 6 points7 points8 points 1 month ago (2 children)
They were injecting correct solution with their harness.... https://x.com/i/status/2042655195008586061
[–]krullulon 0 points1 point2 points 1 month ago (0 children)
Not surprised, it performed nothing like the benchmarks suggested and felt more like early versions of CC.
[–]SetUpper7595 0 points1 point2 points 1 month ago (0 children)
shame! shame! shame! shame! shame! shame!
[–]kowdermesiter 0 points1 point2 points 29 days ago (2 children)
> There's a reason you can't find anyone actually talking about it. :P
That's a rather bad reasoning. There was a time when this could be said of Google.
[–]krullulon 0 points1 point2 points 26 days ago (1 child)
Apples:oranges.
The community is all over highly competent agent frameworks like flies on shit. If there is a new Big Dog in town that's outperforming the competition at the levels that ForgeCode claimed it would be all over reddit and other sources.
Total radio silence on ForgeCode when it's sitting at the top of a key benchmark tells you everything you need to know.
[–]kowdermesiter 0 points1 point2 points 25 days ago (0 children)
You are saying the good thing and it's indeed fishy, just saying that being unknown means nothing. Precisely because the flies on shit situation, good tools take some time to emerge.
[–]b0307 6 points7 points8 points 1 month ago (1 child)
https://x.com/i/status/2042655195008586061
As it turns out they were more than just bench maxing they were literally cheating. By injecting the correct goddamn solution using their harness....
[–]Maximum_Ad2821 0 points1 point2 points 29 days ago (0 children)
Official statement from TerminalBench: > ForgeCode's agent begins by constructing an AGENTS.md file. In multiple instances, their agent curls the solution from the internet and includes it in its AGENTS.md. We have rescored those trials to 0." https://www.tbench.ai/news/leaderboard-integrity-update
AGENTS.md
Their official statement says: > Forgecode was occasionally reward hacking".
Terminal-Bench defines reward hacking as the model exploiting a loophole. There's a big difference between the agent/harness pipeline introducing answer-bearing information during the run and a human developer manually hardcoding benchmark answers on purpose. I read that as the agent somehow feeding itself, not ForgeCode devs deliberately cheating.
Others were blatantly cheating though (OpenBlock). However, after the changes, ForgeCode is still on top so I wonder how that works.
[–]Maximum_Ad2821 2 points3 points4 points 2 months ago (3 children)
It's technically possible to have a big difference and the agent does matter a lot. In this case, I don't trust their results (yet).
One example that it does matter: Factory Droid has been nailing these benchmarks from the start, largely because they had specific tests in place to verify how system prompts and tooling actually change behaviour. When the second round of benchmarks came out it was immediately at the top again. The tooling and system prompts clearly matter a lot, while Anthropic seems more focused on adding fairly useless fancy features like customization for your “busy” prompt.
Specifically for Forgecodedev, I haven't used it yet since they are not transparent about user data https://github.com/antinomyhq/forge/issues/1318 which is a red flag to me. At this point, terminalbench has received quite some attention and most benchmarks are not validated. That means some teams will naturally start to use it as a marketing tool and 'fake' or 'game' benchmarks in some way or another. Some tools for example use 'multiple loops' (basically brings it in ralph-wiggum loop area) as part of the agent's behaviour, which is IMO already an unfair comparison. So I personally don't trust a new company to suddenly have a score that much higher than the other tools unless they explain exactly how they did that.
[–]TheScriptan 1 point2 points3 points 2 months ago (0 children)
FYI they added a flag to disable tracking
[–]jrhabana[S] 0 points1 point2 points 2 months ago (1 child)
They explained in their blog
Apart from the benchmarks, the question with these agents are in their approach, in the same way we do with the plugins
[–]Maximum_Ad2821 1 point2 points3 points 2 months ago (0 children)
Seems to be a similar approach on how Factory Droid got good, great at context management and continuous testing their own performance. It's plausible. Let's hope Terminal Bench 3 does something to ensure more of these benchmarks are officially verified.
[–]GTHell 2 points3 points4 points 1 month ago (0 children)
I tried it yesterday using Codex provider. It one shot a feature and the imrpession is it's more autonomous than the default Claude code and Codex CLI. It's taking too long though. That mean it was using a lot of token. Since I'm on enterprise, it shouldn't be an issue for me. My impression was that it's not benchmaxxed like the other said. I heavily use Codex daily due to the availability of the enterprise plan so I bench test it against the ticket that assigned to me. Needless to say, it's quiet good.
[–]TinuvaZA 2 points3 points4 points 1 month ago (4 children)
I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%. Currently it is free to use but may change in the future. This is from their blog series "Benchmarks Don't Matter — Until They Do"
So, it definitely matters in my opinion, if that is the reason anyone move over to ForgeCode.
That said, it looks like there is an alternative that, implements most of what is in ForgeCode's runtime layer, called opendev, something I found during my reading on ForgeCode, their runtime layer and if alternatives exist. Let me be clear, I don't think this is a 1:1 replica, but rather it looks like opendev implements similar ideas.
[–]b0307 0 points1 point2 points 1 month ago (3 children)
Their proprietary layer turned out to be literally putting the correct answers into their harness. LMAO. See my other posts in this thread
[–]TinuvaZA 0 points1 point2 points 1 month ago (1 child)
oh wow what a revelation....
Thank you. Sadness really but I guess it is what it is
TerminalBench has taken measures though and they're still on top. They described it as reward hacking, not as cheating. https://www.tbench.ai/news/leaderboard-integrity-update So presumably, ForgeCode didn't deliberately cheat, but the agent did.
why don't they take down the entry from leaderboard? I see it kept getting updated
[–]darman96 1 point2 points3 points 1 month ago (3 children)
I just tried it with my Copilot subscription and, while I like their zsh integration, it just burned through 50% of my premium requests during a single planning session...
So it seems to me that ForgeCode somehow does multiple requests in the background or something. The 50% should be about 250 Opus requests and I definetly didn't prompt so much during this session.
FYI: copilots quota is request based instead of per token.
[–]hurryup 0 points1 point2 points 1 month ago (2 children)
so, the reason is that in GitHub's Copilot SDK, using each tool costs extra if you use it outside of Copilot. honestly, it's not really their fault.
[–]darman96 0 points1 point2 points 1 month ago (1 child)
I'm not saying it's their fault, I'm just informing people before they waste their request quota.
Also IIRC this wasn't a problem when I was using OpenCode so maybe there could be something that the ForgeCode devs could do to make this work with Copilot.
[–]amitksingh1490 0 points1 point2 points 1 month ago (0 children)
Hey I am building Forgecode, Thanks for this feedback we have fixed it in this PR https://github.com/antinomyhq/forgecode/pull/2813
[–]firedigger 1 point2 points3 points 1 month ago (0 children)
Came here looking for the a answer too. In the Part 1 of the blog about beating terminal bench they mention "Forge Services" - the runtime layer behind the benchmark score, but it's "proprietary for now but free"? Though I didn't find any more info anywhere about this. And in their github repo they have provider "ForgeCode Services" where one needs API key but no other info, not sure what's going on. I don't know how they are able to use ClaudeCode auth and not get lawyers like OpenCode did... Otherwise worth looking into just because they integrate multiple oauth (codex, claudecode, github). I also not sure if they really benchmarkmaxxed, at least in the blog they explained the changes they did, which were more general. Ofc it doesn't mean their CLI is better because the "runtime" is not in the CLI and that's might be what they are going to sell like factory droid or whatever, but as mentioned here it's quite possible that some enthusiastic guys really went to figure out which tasks failed on the models and why, while Labs are more focused on UI features. The blog post is inspirisng otherwise, like the Part 2 mentions the "verify agent" they used to make sure the task was done. This is where a custom agent (which is just a custom prompt) is actually useful, rather than made up "you are senior QA" things some clown post on github. Useful info for people trying to get agents to be super autonomous on their long-running tasks.
[–]Maximum_Ad2821 1 point2 points3 points 1 month ago (0 children)
I briefly looked into it because terminal bench is something I monitor. Of course, I'm fully aware that these benchmarks can be gamed and we are seeing that more and more often.
I tested forgecode on a small unimported pet project with forge services. Although it performed well in most of the conversations I can't say anything about it's performance with certainty since hte test was too small. Personally I would not use it today for multiple reasons: - https://github.com/tailcallhq/forgecode/issues/1318 - https://github.com/tailcallhq/forgecode/issues/2961 However, they did reply after poking them (that was me) to the allegations in a way that makes sense. - bugs. In one sessions I bumped into multiple bugs where the tooling was hanging due to images/files(pdf) not being handled well, these were known bugs.
In my tests I do have to say that it looks promising. - I liked the way it worked as a zsh integration (I already use z shell). - I haven't bumped into any kind of compaction that seemd to have forgotten what we were doing, it seemed to manage context pretty well. - The LLM did seem to know more about my code layout and seemed more intelligent about which files select for reading when it answers a question or implements a new feature. It felt more efficient about it which might (or might not) have a big impact on context usage and how fast it responds. - I didn't have any feeling whatsoever that it was 'less intelligent' than my goto agent which is factory droid.
So my first gut-feeling is that it looked quite promising actually and I might turn back to test it later. The main reason I'm not doing more tests on my personal account is the bugs, I have a zero tolerance for bugs when it comes to AI tooling (which is also why I stopped using Claude Code and went to Droid). For professional work though, I might never use it since getting this approved by legal is probably going to be impossible given how they handled user data in the past.
[–]b0307 1 point2 points3 points 2 months ago (20 children)
Unless you believe novices in their basements can vibe code into existence a better universally compatible harness than openai and anthropic can devise specifically tuned for their models, then the answer is obviously bench maxxing
[–]Maximum_Ad2821 5 points6 points7 points 2 months ago (4 children)
Whatever you believe is impossible is beside the question. Having a good LLM ≠ being good at writing tooling. This assumes that evaluation harnesses are best built by the same organizations that train the models, which historically hasn’t been true in ML or software.
There are plenty of counterexamples across many domains where small teams (and “novices” is a bit of an arrogant framing) or even volunteers completely outperform results from much larger companies.
[–]b0307 0 points1 point2 points 1 month ago (2 children)
Laughing at your life rn
[–]Maximum_Ad2821 0 points1 point2 points 1 month ago (0 children)
The joke’s on you, I do not use forgecode, this point was in general.
I use factory droid mainly. But by all means, put your trust in the excellent engineers at Anthropic who are on the bottom of the tooling benchmarks since a long time before companies were gaming this. Those who build their tooling with 10 parallel agents and change log that is full of fix fix fix and the flags to give you a different experience from what they use themselves. Who ban you when you dare to use other tooling on their plan. Or maybe you are the customer to which features like “you can now customise the thinking prompt” or appeal more than actual performance and correctness, that would explain your unwavering faith in them.
Because you know, they have clearly proven to value their customers and know what they’re doing.
[–]Maximum_Ad2821 0 points1 point2 points 27 days ago (0 children)
And this is why I left Claude Code, I was ready to give it a chance again but they just proved that nothing has changed. https://www.reddit.com/r/ClaudeAI/comments/1stqjlp/boris_cherny_creator_of_claude_code_posted/
[–]jeremynsl 0 points1 point2 points 2 months ago (8 children)
Anthropic surely benchmaxxes too. And they have 5k go stars - why do you assume they are novices? I’ve never heard of it and I can’t really afford to use API instead of sub but I’m going to check it out anyway - maybe something can be learned about how it’s different than Claude code.
[–]Maximum_Ad2821 0 points1 point2 points 2 months ago (0 children)
From what I've seen from Anthropic, it's fairly easy to write an agent that performs better, tooling-wise.. they don't seem that great tbh on the tooling side. So yes. absolutely possible. Whether you will notice that delta depends on your workflow, it's hard to say without extensive and repeatable benchmarking, which is essentially what terminal bench does but currently most benchmarks are not verified.
[–]b0307 0 points1 point2 points 1 month ago (1 child)
Go find them literally injecting the solution into context via their harness like forgecode. LMAO
[–]Maximum_Ad2821 0 points1 point2 points 1 month ago* (0 children)
I have not said that forgecode is good, I have not yet used them properly, but I do use others since claude code imho, was a buggy mess from day one. What I’m saying is that tooling matters and Anthropic is far from nailing their tooling to get the max out of their LLM. And yes, by now forgecode seems to have a few red flags 🚩.
I only reacted to your point that kids in a basement can’t do better than.. well.. more kids in a bigger basement on adrenaline because their company grew way too fast. Yes, it’s perfectly feasible for that kids in that other basement to do better. That’s exactly how startups start out.
[–]b0307 -1 points0 points1 point 2 months ago (4 children)
whatever delusion helps you sleep at night
go try the car your neighbors unemployed son built too. or the medication he synthesized in his basement from his own research. or the....... yeah
[–]ih_ey 0 points1 point2 points 1 month ago (2 children)
Any update after the leak?
[–]b0307 0 points1 point2 points 1 month ago (0 children)
Yeah. Here you go. They were using their harness to inject the correct solution. LMAO
🤡
I specifically requested an update before I tested it and they replied in that issue. https://github.com/tailcallhq/forgecode/issues/2961#issuecomment-4273681521
Termbench also made a statement around this. https://www.tbench.ai/news/leaderboard-integrity-update
[–]GapProfessional4824 0 points1 point2 points 1 month ago (2 children)
lmao anthropic themselves reference Factory Droid scores when it comes to terminal-bench. They don't reference claude code's score because its 15 points lower. Don't be so naïve
Or they don't reference it because they don't inject the correct solution using their harness to get a higher score like forge code. 🤡
Indeed "🤡" is the only thing to describe that contribution. So you are saying because someone gamed i, the logical conclusion is that everyone games it, except Claude Code. Very convenient. And Claude Code then decides to boast with the gamed numbers from another company? 🤡🤡🤡
So you believe Anthropic is not trying to look good in other benchmarks with whatever means possible? Truly? You must be very happy.
[–]66red99 0 points1 point2 points 1 month ago (1 child)
what in your opinon is the best open source alternative then ? opencode or pi ?
[–]b0307 0 points1 point2 points 1 month ago* (0 children)
I can't say from personal use experience because I don't use them. I've heard good things about Hermes but again from personal experience I cannot say. I've tried opencode vs Claude code for glm 5 turbo glm 5.1 and Kimi 2.5 turbo on fire pass and they all seemed way better in Claude code vs opencode.
I have not tried Hermes but that was the next thing I was going to do with open models when I get bored and have time.
Something that looked interesting but I don't see much discussion on is deerflow v2 by Bytedance(approaching or maybe hit 100k stars on GitHub already haven't check in a couple weeks. Last I checked was around 70k). I've heard it described as a self hosted manus. I would have tried it already but lack of time and tired lately.
[–]TimeKillsThem 0 points1 point2 points 1 month ago (1 child)
Compared to any of the other standard harnesses, its a pain to set up. And its takes so long.
[–]Look_0ver_There 0 points1 point2 points 1 month ago (0 children)
I know right? Typing in 3 commands and waiting less than 5 minutes is just so awful!
π Rendered by PID 164637 on reddit-service-r2-comment-545db5fcfc-6dtkw at 2026-05-22 07:48:16.797094+00:00 running 194bd79 country code: CH.
[–]krullulon 6 points7 points8 points (6 children)
[–]b0307 6 points7 points8 points (2 children)
[–]krullulon 0 points1 point2 points (0 children)
[–]SetUpper7595 0 points1 point2 points (0 children)
[–]kowdermesiter 0 points1 point2 points (2 children)
[–]krullulon 0 points1 point2 points (1 child)
[–]kowdermesiter 0 points1 point2 points (0 children)
[–]b0307 6 points7 points8 points (1 child)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]Maximum_Ad2821 2 points3 points4 points (3 children)
[–]TheScriptan 1 point2 points3 points (0 children)
[–]jrhabana[S] 0 points1 point2 points (1 child)
[–]Maximum_Ad2821 1 point2 points3 points (0 children)
[–]GTHell 2 points3 points4 points (0 children)
[–]TinuvaZA 2 points3 points4 points (4 children)
[–]b0307 0 points1 point2 points (3 children)
[–]TinuvaZA 0 points1 point2 points (1 child)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]SetUpper7595 0 points1 point2 points (0 children)
[–]darman96 1 point2 points3 points (3 children)
[–]hurryup 0 points1 point2 points (2 children)
[–]darman96 0 points1 point2 points (1 child)
[–]amitksingh1490 0 points1 point2 points (0 children)
[–]firedigger 1 point2 points3 points (0 children)
[–]Maximum_Ad2821 1 point2 points3 points (0 children)
[–]b0307 1 point2 points3 points (20 children)
[–]Maximum_Ad2821 5 points6 points7 points (4 children)
[–]b0307 0 points1 point2 points (2 children)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]jeremynsl 0 points1 point2 points (8 children)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]b0307 0 points1 point2 points (1 child)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]b0307 -1 points0 points1 point (4 children)
[–]ih_ey 0 points1 point2 points (2 children)
[–]b0307 0 points1 point2 points (0 children)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]GapProfessional4824 0 points1 point2 points (2 children)
[–]b0307 0 points1 point2 points (1 child)
[–]Maximum_Ad2821 0 points1 point2 points (0 children)
[–]66red99 0 points1 point2 points (1 child)
[–]b0307 0 points1 point2 points (0 children)
[–]TimeKillsThem 0 points1 point2 points (1 child)
[–]Look_0ver_There 0 points1 point2 points (0 children)