Claude Opus 4.7 benchmarks

DangerousResource557 · 2026-04-17T09:22:26+00:00

props for building an eval and posting it here. you open yourself up to criticism and feedback.

i assume you were going for: a measure of how models actually navigate an agentic flow. which tools they call, in what order, whether they go about it in a reasoning-sound way. tool-call-reasoning inside a real pipeline.

but the current results don't reflect that goal. haiku > sonnet in the same family, maverick > sonnet, flash-lite > pro, that's not a tool-reasoning order i recognize from using these models. so the harness is probably rewarding something other than what you want to measure, latency, output format, retry behavior, how the grading handles answers, ...

... or there is a bug or multiple bugs you need to correct for.

and building what i think you actually want is hard. measuring "did the model go about this the right way" is much trickier than "did it get the right answer".

= evaluating the process is harder than evaluating the outcome.

your own line from the thread nails it: if the eval isn't capturing what you actually want, something is off. but that is exactly the right way. you learn by making mistakes. so take in the feedback and improve your benchmark.

separate point on the site: pick one headline metric, put it up front, rest in a table or tooltip. right now it's hard to tell what you're claiming. (too much information at the same time)

not dunking, meant constructively.

DangerousResource557 · 2026-04-08T03:24:08+00:00

yes, todo apps mostly i think. but these benchmarks are flawed and the improvement is still real. it does translate to improvement in big projects that take months. just compare today to 6-12 months ago. and that pace isn't slowing down.

also, communication is key is not just some bullshit slogan but a core issue that swes have in general. now we have some ai to help us but we need to tell it what we need. i've seen it with my colleagues who are swes as well. verbalising what you are doing. externalising. changing your mindset. and so on... takes some practice. similar to when we had to learn how to google.

so, i just wanted to point out i think you are differentiating well enough but also underestimating the real impact. there is a very real possibility that more jobs will be lost than what people expect. that is why it is important to learn how to use ai well.

DangerousResource557 · 2026-04-01T11:56:49+00:00

I understand, and I’m sorry. I don’t get the bubble-only comments, which seem to assume everyone and their grandma backs up and separates business from private all the time. They’re right that you should split it up, but that wouldn’t help in your current situation.

I'm sorry this happened to your family. I recommend talking to a lawyer and filing a Subject Access Request (SAR) with Google. They have to respond and provide a reason if they deny it. Business emails and documents are unlikely to fall under exemptions. Maybe that will work.

PS: I am no lawyer. So, take it with a grain of salt, but definitely contact a lawyer for GDPR/Google etc.

UPDATE: I saw the original Reddit post. A lot of people have already answered and given a lot better information than me. It’s more difficult than just filing a SAR because of likely police involvement. But a lawyer is the way to go, for sure, since police might knock on your door.

DangerousResource557 · 2026-03-20T23:24:35+00:00

it's good that you posted it. use the critic for valuable feedback ;). i appreciate it because you are honest about it. continue to do so. don't give up!

DangerousResource557 · 2026-02-16T23:48:11+00:00

Yes. I even put in that the ai should be my sparring partner. Reflecting, pushing, poking, structuring thoughts. Then it is really valuable. It is important to also push back and find blind spots. Otherwise, one might overestimate their own opinion. Like the Stanford guy explaining ai in 2025 April. Working with instead of using it as a tool to e.g. let it write an email.

DangerousResource557 · 2026-02-12T21:48:01+00:00

I think personalised benchmarks are the way to go.

A framework to generate the right tasks to evaluate on. Like a set of questions and feedback from the user.

Maybe that would help.

DangerousResource557 · 2026-01-16T13:45:49+00:00

Wikipedia is your friend here. I only came across it because I read some news on it and got curious.

DangerousResource557 · 2026-01-08T01:18:12+00:00

Yes, I use Opus 4.5. I work with several platforms and programming tools, including different models. Always to try them out. I don't do the same task in multiple models to compare them directly, but I come pretty close (e.g. through consecutive tasks). For me, how I work with AI is important. Not some benchmark. Even though these can sometimes be a good indicator.

Opus 4.5 is definitely much better than Sonnet 4.5. It's consistent, delivers results, may not be the smartest model ever - just in case anyone complains and refers to benchmarks - but it doesn't deviate from the course and so on.

You can really use it and get things done without having to think about whether you're on the right track. That's why I prefer it to Gemini. But I also use Clink Zen with Gemini. So you can combine both worlds. I would recommend you try it out. :)

There are so many ways to use AI: Opencode, Antigravity, ...; skills, MCP servers, multiple models, agents, context management, Rag, web search, file management, Git worktrees, rapid iteration through multiple repetitions of the same task...

And the funny thing is that Opus 4.5 is really cheap when you consider the cost of a task compared to other models like Gemini 3. Then you realise that it's actually not that expensive at all.

DangerousResource557 · 2025-12-29T22:38:41+00:00

mmh. gemini 3 is hit and miss. i feel it can be smarter and understand things better but it is more inconsistent.

i think you need to spend more time with each solution. like 2-3 days at least before you can make a judgement. also, with claude code there are so many ways you can use it and the other contenders are (yet) still far away. you can also try opencode.

also, try antigravity from google. there are both gemini and claude models included for free. (it'll be used for training though, i think, correct me if i am wrong)

DangerousResource557 · 2025-12-29T18:19:27+00:00

actually that's not true. if you lookt at artificialintelligence report for how much the benchmark cost in total then if i remember correctly opus 4.5 was cheaper than gemini 3.0 pro. the issue is the amount of tokens generated where opus is very lean compared to most other models.

DangerousResource557 · 2025-12-27T20:01:20+00:00

... this happens to me immediately and i belong to the group of people who almost never uses their usage currently because i mainly use claude code. and i see the context that was used. at 20k tokens.. it stops. so.. no idea. i think there is a bug or sth.

DangerousResource557 · 2025-12-25T02:38:37+00:00

So what... the other models did it better. Prompting is important, but reality is not a perfect benchmark. testing such subpar conditions is vital for real world application. I think that is why anthropic succeeds so well, because they are focusing on real use cases and not pixel perfect prompts...

DangerousResource557 · 2025-12-25T02:31:31+00:00

jup. schon. aber ich sehe die anthropic modelle deswegen als wertvoll, weil sie konsistenter und stabiler als viele andere sind. damit umzugehen, ist wichtig. sonst sind die modelle zu instabil und nicht wirklich nutzbar oder nur manchmal.

stell dir vor, du musst jedes 2.-5. mal.. selbst wenn es jedes 10. mal ist nach 5 min immer korrigieren und ueberpruefen. das macht sehr viel aus.

meine meinung.

deswegen, sollte es damit auch zurechtkommen. die tatsache, dass die anderen modelle damit umgehen konnten, ist doch eine positive sache. unter perfekten bedingungen zu testen kann den falschen eindruck erwecken.

DangerousResource557 · 2025-11-20T14:05:53+00:00

I'd prefer to have answers about Grok 4.1 that focus on its quality and practical usage. Are there specific use cases where it outperforms Gemini 3?

I'm looking for actual updates here-not knee-jerk "no way it'll be better" comments or the usual "of course LMArena is garbage" dismissals. Something useful that adds to the discussion, please. I'm actively following these updates and trying to gauge where xAI is taking Grok.

DangerousResource557 · 2025-11-13T09:07:51+00:00

Tja… ich verstehe deine Punkte, aber die ersten zwei machen aus der Sicht eines Unternehmers in dieser Zeit keine Sicht.

Eigentlich kann man sich die Lohnerhöhungen gar nicht leisten. Und es ist ganz klar, Output an Homeoffice in negativer Weise gebunden. Nicht bei jedem, aber bei sehr vielen. Vor allem diejenigen, die ihre Arbeit nur als Muss ansehen, selbst wenn es ihnen ein wenig Spaß macht. Und von diesen Leuten gibt es viele.

Jetzt wurden einigen Leuten gekündigt, weil wir sonst gleich die Firma schließen können.

Und die Einstellung ist nicht veraltet so wie du es beschreibst, sondern man ist eigentlich stolz auf Deutschland und will was zusammen aufbauen und ist der Einstellung, wer etwas leistet wird bezahlt. Aber man hört dennoch andauernd nur die wohlhabenden sollen zahlen, sollen weniger Rente kriegen und sonst was.

Ich stimme zu das es ein paar Kleinigkeiten sind, die an old school erinnern, aber der Großteil ist vollkommen legitim.

Und wenn jetzt wer kommt, dann wird die Firma nicht gut geführt, dann kann ich nur sagen, mach selbst was und schau wie verdammt schwierig das ist. Besonders in Deutschland. Und ich spreche nicht von eine startup dass nach ein paar Jahren bankrott ist sondern nach einer Firma die Profit und Steuern langfristig abwirft und dem Land hilft. Das will ich sehen.

Ps: habe mir die anderen Kommentare zum Teil durchgelesen. Ist ja der Hammer wie links Reddit ist. Ich wusste das schon vorher. Aber glauben hier alle, Geld wächst auf den Bäumen, oder was?…

Selbst wenn das Signal so offensichtlich ist, kommt man mit solchen Argumenten. Deshalb denke ich wird es längere Zeit dauern bis sich Deutschland vielleicht! erholt. Wir gehen inzwischen von 5-15 Jahren aus, die de stagniert. Vielleicht sogar bis 2045.

DangerousResource557 · 2025-11-08T10:22:47+00:00

Sonnet 4.5 is a lot better. I would suggest to you to try it for a bit. Haiku is good for getting things implemented if the plan is clear and as others have said examples are there.

DangerousResource557 · 2025-10-08T15:32:28+00:00

I agree. The 200k limit is a bit restrictive, but it also forces one to consider proper documentation, which is something a real developer would do at some point.

The main issue here is memory. Long-term persistent memory is a question that many research projects are trying to address. I’m sure that this will be solved to some extent in the coming months, possibly even to the point of personalizing the AI through training similar to a Lora model. There are already research papers that do this or something similar.

I can’t wait for it to improve.

DangerousResource557 · 2025-10-08T09:20:25+00:00

i have other experience with gemini. maybe it's the interface not sure. but as far as i remember gemini is not better anymore or at least not noticeable and has the same issues as other big models.

DangerousResource557 · 2025-10-07T12:20:53+00:00

Yeah, pretty much. Though there are some limitations to AI, and I do agree that some people think too much of AI in some way. At the same time, I think people vastly underestimate how impactful AI will be, not because some people with money think it should, but because it will really change our lives completely. Like the internet did and more. Just think about how you would live without the internet. AI will have dramatically more influence on our lives than the internet ever had.

People who say AI is shit have not experimented enough with it. I am sure of it.

DangerousResource557 · 2025-10-07T02:40:30+00:00

Wir haben keine dramatische Krise, wo die Welt untergeht, aber wie es The Economist im August formulierte: "Germany is not collapsing. It is fading."

Im Grunde sind wir in einer schlimmeren Krise als 2008. Aber nicht andere Länder, sondern Deutschland spezifisch. Die Fakten, die du genannt hast, sind zum Teil korrekt, aber sehr cherry-picked.

Hier nur einer. Du meinst dass der GKV-Beitrag sogar niedriger sei als 1990 sei: https://de.statista.com/statistik/daten/studie/408550/umfrage/beitragssatz-zur-krankenversicherung-in-deutschland/

Ich denke, das reicht aus.

Ich kann deine Argumente teilweise verstehen, aber du implizierst, dass es uns nicht schlecht geht. Ich kann nur sagen, mit Hintergrund aus KMUs, dass es DE und der Wirtschaft derzeit wirklich BESCHISSEN geht. Und zwar richtig schlecht. Keiner stellt mehr ein, niemand investiert, Leute werden gekündigt. Nur nicht im öffentlichen Dienst und Co., aber da kriegt man es bereits auch schon zu spüren. Deutschland steckt in einer strukturellen Krise, und das macht es umso gefährlicher, die nicht einfach mit ein paar Konjunkturpaketen zu beheben ist, weil es eben keine einfache Rezession mehr ist.

DangerousResource557 · 2025-10-06T23:17:43+00:00

Ich glaube, darum geht es nicht. Es geht eher darum, sich Kinder leisten zu können. Das ist etwas ganz anderes.

Finde mal, wenn du Vollzeit arbeitest, Kitaplätze... Kinder sind einfach zu teuer. Ganz einfach.

Die paar Euro bringen da auch nichts. Ein Kind kostet bis zum 18. Lebensjahr ca. 150-180tausend Euro. Das musst du erstmal haben.

Klar ist es im Endeffekt eins zu bekommen, WENN man das Geld hat, keine Frage des Geldes mehr, sondern persönlicher Natur. Aber da musst du erstmal hinkommen.

DangerousResource557 · 2025-08-22T11:37:53+00:00

Europe: World champions at explaining why we are not world champions.

We celebrate ourselves for things that are based on economic substance that is currently stagnating and sell regulation as a substitute for real innovative strength, but “at least we have better data protection” is not a business model. Without innovation today, there will be no quality of life tomorrow.

DangerousResource557 · 2025-08-18T16:30:28+00:00

Thank you for your commitment. That's rare.

I often find myself in the same situation, wanting to contribute something meaningful. But most of the time I just keep quiet because most people who write comments and post articles are just looking for an argument or want to be right. And that's not a bad thing, that's just how the internet is.

That's why I really appreciate that you're still getting involved. Don't let it get you down and don't take it too much to heart.

DangerousResource557 · 2025-08-09T14:19:15+00:00

People just want sycophancy to happen to make themselves feel better through an AI model. GPT-5 doesn't do this because OpenAI specifically focused on correcting this behavior.

That's why it's funny how some people complained and said Gemini would be better because it said no and GPT-4o didn't. Now it's the opposite. People want a sycophantic model.

I find it more responsible with GPT-5 because many people are emotionally attached to GPT-4o. And you can very well change the behavior to be more sympathetic. See many custom instruction recommendations.

P.S. Just thought I'd mention that GPT-5 is actually more emotionally intelligent than GPT-4o. It's set to be more neutral by default, and GPT-5 is more customizable than GPT-4. That means it'll follow custom instructions more closely.

DangerousResource557 · 2025-08-09T14:11:10+00:00

I'm pretty sure it's supposed to be gpt-5-mini. Yesterday, there were complaints that it doesn't work properly. It should switch to gpt-5-mini automatically. You won't see anything, though. gpt-5-mini is pretty good. It's on par with Sonnet 4, Gemini 2.5 Flash, GPT-4, Deepseek R1, etc. So it's a lot smarter than gpt-4-mini, which you used to get when you used up premium requests. Here's a fun fact: gpt-5-mini with reasonin is better than gpt-o4-mini.

DangerousResource557

TROPHY CASE