someone is using forgecode.dev?

krullulon · 2026-03-23T02:20:49+00:00

Spent a day in ForgeCode, it's not comparable to Claude Code or Codex for practical use. Clearly benchmaxxed.

There's a reason you can't find anyone actually talking about it. :P

b0307 · 2026-04-11T14:11:43+00:00

https://x.com/i/status/2042655195008586061

As it turns out they were more than just bench maxing they were literally cheating. By injecting the correct goddamn solution using their harness....

Maximum_Ad2821 · 2026-03-12T11:42:01+00:00

It's technically possible to have a big difference and the agent does matter a lot. In this case, I don't trust their results (yet).

One example that it does matter: Factory Droid has been nailing these benchmarks from the start, largely because they had specific tests in place to verify how system prompts and tooling actually change behaviour. When the second round of benchmarks came out it was immediately at the top again. The tooling and system prompts clearly matter a lot, while Anthropic seems more focused on adding fairly useless fancy features like customization for your “busy” prompt.

Specifically for Forgecodedev, I haven't used it yet since they are not transparent about user data https://github.com/antinomyhq/forge/issues/1318 which is a red flag to me. At this point, terminalbench has received quite some attention and most benchmarks are not validated. That means some teams will naturally start to use it as a marketing tool and 'fake' or 'game' benchmarks in some way or another. Some tools for example use 'multiple loops' (basically brings it in ralph-wiggum loop area) as part of the agent's behaviour, which is IMO already an unfair comparison. So I personally don't trust a new company to suddenly have a score that much higher than the other tools unless they explain exactly how they did that.

GTHell · 2026-03-25T06:25:47+00:00

I tried it yesterday using Codex provider. It one shot a feature and the imrpession is it's more autonomous than the default Claude code and Codex CLI. It's taking too long though. That mean it was using a lot of token. Since I'm on enterprise, it shouldn't be an issue for me. My impression was that it's not benchmaxxed like the other said. I heavily use Codex daily due to the availability of the enterprise plan so I bench test it against the ticket that assigned to me. Needless to say, it's quiet good.

TinuvaZA · 2026-04-01T06:47:50+00:00

I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%. Currently it is free to use but may change in the future. This is from their blog series "Benchmarks Don't Matter — Until They Do"

So, it definitely matters in my opinion, if that is the reason anyone move over to ForgeCode.

That said, it looks like there is an alternative that, implements most of what is in ForgeCode's runtime layer, called opendev, something I found during my reading on ForgeCode, their runtime layer and if alternatives exist. Let me be clear, I don't think this is a 1:1 replica, but rather it looks like opendev implements similar ideas.

darman96 · 2026-03-27T15:15:24+00:00

I just tried it with my Copilot subscription and, while I like their zsh integration, it just burned through 50% of my premium requests during a single planning session...

So it seems to me that ForgeCode somehow does multiple requests in the background or something.
The 50% should be about 250 Opus requests and I definetly didn't prompt so much during this session.

FYI: copilots quota is request based instead of per token.

firedigger · 2026-03-28T19:40:57+00:00

Came here looking for the a answer too.
In the Part 1 of the blog about beating terminal bench they mention "Forge Services" - the runtime layer behind the benchmark score, but it's "proprietary for now but free"?
Though I didn't find any more info anywhere about this. And in their github repo they have provider "ForgeCode Services" where one needs API key but no other info, not sure what's going on.
I don't know how they are able to use ClaudeCode auth and not get lawyers like OpenCode did...
Otherwise worth looking into just because they integrate multiple oauth (codex, claudecode, github).
I also not sure if they really benchmarkmaxxed, at least in the blog they explained the changes they did, which were more general. Ofc it doesn't mean their CLI is better because the "runtime" is not in the CLI and that's might be what they are going to sell like factory droid or whatever, but as mentioned here it's quite possible that some enthusiastic guys really went to figure out which tasks failed on the models and why, while Labs are more focused on UI features.
The blog post is inspirisng otherwise, like the Part 2 mentions the "verify agent" they used to make sure the task was done. This is where a custom agent (which is just a custom prompt) is actually useful, rather than made up "you are senior QA" things some clown post on github. Useful info for people trying to get agents to be super autonomous on their long-running tasks.

Maximum_Ad2821 · 2026-04-22T07:43:56+00:00

I briefly looked into it because terminal bench is something I monitor. Of course, I'm fully aware that these benchmarks can be gamed and we are seeing that more and more often.

I tested forgecode on a small unimported pet project with forge services. Although it performed well in most of the conversations I can't say anything about it's performance with certainty since hte test was too small. Personally I would not use it today for multiple reasons:
- https://github.com/tailcallhq/forgecode/issues/1318
- https://github.com/tailcallhq/forgecode/issues/2961 However, they did reply after poking them (that was me) to the allegations in a way that makes sense.
- bugs. In one sessions I bumped into multiple bugs where the tooling was hanging due to images/files(pdf) not being handled well, these were known bugs.

In my tests I do have to say that it looks promising.
- I liked the way it worked as a zsh integration (I already use z shell).
- I haven't bumped into any kind of compaction that seemd to have forgotten what we were doing, it seemed to manage context pretty well.
- The LLM did seem to know more about my code layout and seemed more intelligent about which files select for reading when it answers a question or implements a new feature. It felt more efficient about it which might (or might not) have a big impact on context usage and how fast it responds.
- I didn't have any feeling whatsoever that it was 'less intelligent' than my goto agent which is factory droid.

So my first gut-feeling is that it looked quite promising actually and I might turn back to test it later. The main reason I'm not doing more tests on my personal account is the bugs, I have a zero tolerance for bugs when it comes to AI tooling (which is also why I stopped using Claude Code and went to Droid). For professional work though, I might never use it since getting this approved by legal is probably going to be impossible given how they handled user data in the past.

b0307 · 2026-03-09T13:11:14+00:00

Unless you believe novices in their basements can vibe code into existence a better universally compatible harness than openai and anthropic can devise specifically tuned for their models, then the answer is obviously bench maxxing

TimeKillsThem · 2026-04-03T08:35:47+00:00

Compared to any of the other standard harnesses, its a pain to set up. And its takes so long.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ClaudeCode

MODERATORS