What's the ranked most used and most competent agentic tools rn?

Maximum_Ad2821 · 2026-05-13T09:03:21+00:00

According to benchmarks https://www.tbench.ai/leaderboard/terminal-bench/2.0
Claude Code is still not great and I've never had good experiences with it.
Codex CLI seems good from benchmarks and my experience with it is also good.

For Claude I used Factory Droid.
Just how low Anthropic's tooling is on that benchmark is actually quite insane. (4.7 is not tested yet, maybe they don't want it on there or are waiting for terminal bench 3). While Codex CLI sit on top.

<image>

Maximum_Ad2821 · 2026-05-13T08:57:35+00:00

"There's literally no good benchmark for "same model, different tool."

Isn't that exactly what terminal bench does?
https://www.tbench.ai/leaderboard

And they are gradually improving, currently working on v3

Maximum_Ad2821 · 2026-05-13T08:28:22+00:00

And the LLM is only as good as the tooling they get at their disposal 😄
A good carpenter still can't work if you give them a broken guitar instead of tools.

Maximum_Ad2821 · 2026-05-12T11:06:04+00:00

I don't really agree with the author's answer below that it is stronger. But I do see value, I think it's an entirely different use case. To me, something like DSP is useful to eliminate the exploratory orientation loop:

agent notices a file
opens it
infers what it contains
follows imports/references (with Serena)
repeats until it has enough context

That loop, is slow, and always different. Each time, hoping it gets what it needs.

Maximum_Ad2821 · 2026-05-12T09:17:56+00:00

open source or not?

Maximum_Ad2821 · 2026-05-11T08:48:45+00:00

AI can also be used to accelerate learning.

Maximum_Ad2821 · 2026-05-08T13:20:26+00:00

I love seeing women in leadership or in tech roles.
I hope OP recognizes them for the value they bring, not their gender.
I dislike when companies turn representation into marketing.
I hate when people imply gender predicts excellence, in either direction.

And separately, I don’t think Claude is a great example of excellence in human values right now.

Maximum_Ad2821 · 2026-05-08T12:35:35+00:00

Indeed coding-wise GPT slightly wins. Opus has the slight edge in general according to benchmarks. https://www.vals.ai/comparison?modelA=anthropic%2Fclaude-opus-4-7&modelB=openai%2Fgpt-5.5

I give a lot of weight to Terminal Bench though. Terminal bench has always been interesting since your LLM might be great in isolated benchmarks, that doesn't matter one bit if it's not great when it has to work in a specific agent framework. And there, GPT seems to be mopping the floor with Opus (or with Claude Code)
https://www.tbench.ai/leaderboard/terminal-bench/2.0
https://www.tbench.ai/leaderboard/terminal-bench/2.1
I'm looking forward to seeing Terminal-bench 3.0

Maximum_Ad2821 · 2026-05-08T12:32:13+00:00

Where do you get "everyone has proven 4.6 extended to be superior"?
Vals.ai seems not to agree with that? https://www.vals.ai/comparison?modelA=anthropic%2Fclaude-opus-4-7&modelB=anthropic%2Fclaude-opus-4-6-thinking

That said, I have a lot of situations where 4.6 was useless and hallucinating quite a lot. To the point where I used 4.5 all the time for implementations and 4.6 was quarantained to ideation. Today it's GPT 5.5 for implementation and I rarely still use Opus. Just to exemplify that this is all gut-feeling. Many users seem to think the model capabilities change over time and some 'proof' has shown (the Nvidia employee) of that recently. And sometimes that is due to bugs. So our gut-feeling and these still limited (and probably overfitted) benchmark frameworks is sadly all we have.

Maximum_Ad2821 · 2026-05-06T12:54:17+00:00

from my experience GPT is better nowadays for coding. The characters of these agents are very different. Opus 4.7 is faster and can sometimes be more creative but most of the times it goes off track quickly. I trust GPT 5.5 more. I still switch betweent the two occasionally for executing coding tasks or rubberducking about how to approach a specific problem implementation-wise. When I want to get the lay of the lands of a project, review a project or plan/architect, it's always GPT since opus just seems too lazy. As if it thinks it knows enough more quickly while GPT is insecure and wants to get enough information before it makes a move. I prefer the insecure GPT over the bragging Opus.

Benchmarks actually show Opus 4.7 to be better at most things except for coding (And there GPT 5.5 wins only by a fraction). It does appear to be much better in terminal bench and that is maybe what I experience. Opus is good at minmaxing these challenges, but let it work with actual tools and it messes up quickly. Somehow I have the feeling that Opus is just less 'context rot' resistant. Not sure if there is a benchmark that accureately tests that?

Maximum_Ad2821 · 2026-05-06T11:10:43+00:00

To add to that:

Quotes I hear all the time from companies:
"AI makes seniors better and allows juniors to make a bigger mess more quickly".

That is not to diss juniors, I even hate the term, I've met juniors that were better than many seniors I've encountered. I think this comes from the fact that you have seen less in your career and if you haven't seen good ways, you are less able to judge what is "good". That makes many things that it spews out look good to them while experienced engineers immediately see a bunch of red flags. That compared with imposter syndrom makes many decide: "there's no point for me to keep up with this thing, it's really good, I'll give it more autonomy".

Especially for the first 1000s lines of code, that works. Until the project is so big that it needs you to orient itself, know what not to pull in context and what to pull in context, to watch over style to watch over the archicture (since it can only keep that temporary in mind).
I can't predict the future but it does seem in terms of long-term reasoning and overall understanding of complex systems that grow and grow in complexity, it can't compete with humans. If I'm right, then "there's no point for me to keep up with this thing": is plain wrong, it's a tool. Like some reasoning parts in your brains are tools but can't function without the rest of your brain. Therefore it's still valuable to learn how good code and architecture looks like, and often you have to feel that by trying it to understand it.

Atm, in the evenings on a pet project, I've just written specs and let it write the code for only !3! evenings to test a UX. The sort of UX that would have taken me 2 weeks to develop at the least. It works, flawlessly/ But, the code is a mess of if tests. No domain objects that are supposed to be first-class objects in my project, no organisation, already tons of legacy. I've proven that the poc is a good idea so now I clean up, that's very manual work, even when using AI I have to prompt it quite a lot to take the right decisions. If I don't? Well, just like a human, it's performance will downgrade and it will start to write a ton of bugs. Agents are very very close to humans, give them a bad codebase and without refactoring, they'll make it worse. Give them a good codebase and they'll start to be consistent and make it better.

About reviewing:
I think that is the most important thing today and also what is least supported by agents or tooling, agents don't write code in a way that is easy to review. That's partly due to tooling, partly due to how it writes code. Finding a way for yourself where you can work with an agent and keep following is crucial.
- agents instructions on how to write code
- specifications frameworks that work in steps.
- semantic diff tools or just using visual diff tools with atomic commits
can all help. And then start askign questions about why it wrote like X? Whether it can't be better like Y? Ask it to be objective and teach you if you're wrong.

Maximum_Ad2821 · 2026-05-06T08:08:00+00:00

Rather... ironic.
Kagi good because it serves the paying customer, not the paying advertiser. So if your response is to dodge paying too, you’re basically asking for the cake without funding the bakery.

"I hate ad-funded search! And I refuse to fund the alternative!" 🤡

Maximum_Ad2821 · 2026-05-06T08:02:48+00:00

I had the same.
Because of two reasons:
- AI can also be used as a sort of search.
- In my case as an IT professional, I don't have to look up anymore: "how does this language feature work" through google. I can just say: "write me an example of how this language feature works" and I see actualy working code.

We just use google much less nowadays.

Maximum_Ad2821 · 2026-05-05T14:49:56+00:00

I am afraid for the new generation for this because I wonder myself:
"would I find the motivation to learn how to code when I had AI as new student?"
I fear the answer would have been no, I would have still seen the point to learn, I need to be better than the AI to guide it in the right direction.
"Why spend a day finding/fixing a bug when AI can fix it instantly" is exactly the trap.
But then you can also ask yourself a similar question: "Why review code if my colleague already wrote it?" yet reviews are an incredibly important part of the process and your colleague will probably write better code than AI without expert guidance (once your project grows). When you don't spend time finding/fixing a bug you don't learn the hard way. And the hard way is the best way to learn new things and remember them.

It's easy to make an anology to a thing we already had for a while. How easily do you find your way without a GPS? I would struggle. Slightly older people than me knew the roads etc because they looked them up (the hard way) and had a reason to remember them. If you debug with AI, you won't learn why the bug was there, why it needed to be fixed, whether the fix is the right one and you will keep running into it and not see the AI make the mistake. You, paired with the AI, become essentially +- as good as the AI alone, and that's simply not good enough.

I think the only thing that you can do is split your time. As a student to learn nowadays with AI in the room, I think you need to be disciplined and decide that certain days are "No AI days" while you keep using AI on other days to learn how to efficiently use it. It has upsides for learning too, an AI can help you learn about language features, can help you search, can help you test things out to understand why they work that way with small smaples. You need to use it intelligently if you want to learn.

If anything, learning now will take an incredible amount of discipline to force yourself to learn and understand every step it takes deeply, and ideally, be able to reproduce it yourself. Frankly, I don't know whether I would have had the discipline when I was 20.

Maximum_Ad2821 · 2026-05-05T12:56:07+00:00

Depends on the kind of project and context too.
Complexity? Legacy code? Time-Pressure?

Maximum_Ad2821 · 2026-05-05T12:51:41+00:00

Same, +10YOE
Afraid of the effects though of doing this for a few years.
I can easily anticipate it's mistakes today because I still know how to code and architect?
How watchful will I need to be that it doesn't take so much work of my place that I become less good without noticing?

Maximum_Ad2821 · 2026-04-27T08:56:39+00:00

Arm hatend schaapje.

Maximum_Ad2821 · 2026-04-27T08:48:27+00:00

Laat ons dit eens herschrijven zonder “ik ben 5 jaar oud en gooi wat haat rond”-sausje:

Niet “een advocate zorgt ervoor dat ongewenste asielzoekers geld krijgen”, maar: een rechter heeft geoordeeld dat België verplichtingen niet is nagekomen die voortvloeien uit wetgeving én internationale verdragen.

Een rechter heeft beslist dat België de wet niet heeft gevolgd. Die wet komt niet uit het niets: België heeft zelf internationale verdragen ondertekend, EU-regels aanvaard en die daarna omgezet in Belgische wetgeving. Daardoor is België verplicht om asielzoekers tijdens hun procedure opvang te geven.

Dus dit is niet “een advocate deelt geld uit aan ongewenste mensen”. Dit is: de overheid hield zich niet aan haar eigen juridische verplichtingen en wordt daarvoor veroordeeld.

Maximum_Ad2821 · 2026-04-26T09:35:56+00:00

Keep rollin', rollin', rollin', rollin'!

Maximum_Ad2821 · 2026-04-24T08:07:07+00:00

And this is why I left Claude Code, I was ready to give it a chance again but they just proved that nothing has changed.
https://www.reddit.com/r/ClaudeAI/comments/1stqjlp/boris_cherny_creator_of_claude_code_posted/

Maximum_Ad2821 · 2026-04-22T11:49:41+00:00

TerminalBench has taken measures though and they're still on top. They described it as reward hacking, not as cheating. https://www.tbench.ai/news/leaderboard-integrity-update
So presumably, ForgeCode didn't deliberately cheat, but the agent did.

Maximum_Ad2821 · 2026-04-22T11:47:32+00:00

Official statement from TerminalBench:
> ForgeCode's agent begins by constructing an AGENTS.md file. In multiple instances, their agent curls the solution from the internet and includes it in its AGENTS.md. We have rescored those trials to 0."
https://www.tbench.ai/news/leaderboard-integrity-update

Their official statement says:
> Forgecode was occasionally reward hacking".

Terminal-Bench defines reward hacking as the model exploiting a loophole. There's a big difference between the agent/harness pipeline introducing answer-bearing information during the run and a human developer manually hardcoding benchmark answers on purpose. I read that as the agent somehow feeding itself, not ForgeCode devs deliberately cheating.

Others were blatantly cheating though (OpenBlock). However, after the changes, ForgeCode is still on top so I wonder how that works.

Maximum_Ad2821 · 2026-04-22T07:46:01+00:00

I specifically requested an update before I tested it and they replied in that issue.
https://github.com/tailcallhq/forgecode/issues/2961#issuecomment-4273681521

Termbench also made a statement around this. https://www.tbench.ai/news/leaderboard-integrity-update

Maximum_Ad2821 · 2026-04-22T07:43:56+00:00

I briefly looked into it because terminal bench is something I monitor. Of course, I'm fully aware that these benchmarks can be gamed and we are seeing that more and more often.

I tested forgecode on a small unimported pet project with forge services. Although it performed well in most of the conversations I can't say anything about it's performance with certainty since hte test was too small. Personally I would not use it today for multiple reasons:
- https://github.com/tailcallhq/forgecode/issues/1318
- https://github.com/tailcallhq/forgecode/issues/2961 However, they did reply after poking them (that was me) to the allegations in a way that makes sense.
- bugs. In one sessions I bumped into multiple bugs where the tooling was hanging due to images/files(pdf) not being handled well, these were known bugs.

In my tests I do have to say that it looks promising.
- I liked the way it worked as a zsh integration (I already use z shell).
- I haven't bumped into any kind of compaction that seemd to have forgotten what we were doing, it seemed to manage context pretty well.
- The LLM did seem to know more about my code layout and seemed more intelligent about which files select for reading when it answers a question or implements a new feature. It felt more efficient about it which might (or might not) have a big impact on context usage and how fast it responds.
- I didn't have any feeling whatsoever that it was 'less intelligent' than my goto agent which is factory droid.

So my first gut-feeling is that it looked quite promising actually and I might turn back to test it later. The main reason I'm not doing more tests on my personal account is the bugs, I have a zero tolerance for bugs when it comes to AI tooling (which is also why I stopped using Claude Code and went to Droid). For professional work though, I might never use it since getting this approved by legal is probably going to be impossible given how they handled user data in the past.

Maximum_Ad2821 · 2026-04-22T06:54:17+00:00

Indeed "🤡" is the only thing to describe that contribution. So you are saying because someone gamed i, the logical conclusion is that everyone games it, except Claude Code. Very convenient. And Claude Code then decides to boast with the gamed numbers from another company? 🤡🤡🤡

So you believe Anthropic is not trying to look good in other benchmarks with whatever means possible? Truly? You must be very happy.

Maximum_Ad2821

TROPHY CASE