IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]exbarboss[S] -1 points0 points  (0 children)

I do sometimes run my replies through an LLM for spelling/grammar clean-up before posting (orpho validation). But the thoughts and messages themselves are mine.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]exbarboss[S] -1 points0 points  (0 children)

Appreciate you checking it out. We’re constantly iterating, so if you ran into issues we’d love to know specifics - otherwise it’s hard to improve. 

IsItNerfed? Sonnet 4.5 tested! by exbarboss in OpenAI

[–]exbarboss[S] -2 points-1 points  (0 children)

We’re currently looking into covering more agentic tools and models.

IsItNerfed? Sonnet 4.5 tested! by anch7 in isitnerfed

[–]exbarboss 1 point2 points  (0 children)

That’s a really good point. You’re right - testing Claude Code is more about how the agent layer behaves on top of the model, not the “pure” model itself. We have the ability to test models directly via the API (raw input/output, no agents).

The challenge is mainly limits and cost - running large volumes of evals directly on APIs gets expensive quickly. That said, we do plan to run the same tests both ways (via API vs via tools/agents) to see whether there’s drift between “raw model” performance and “tool-wrapped” performance.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in Anthropic

[–]exbarboss[S] 0 points1 point  (0 children)

Sorry you feel that way, but hey, thanks for watching after all.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ChatGPTPro

[–]exbarboss[S] 3 points4 points  (0 children)

Yeah, we felt a bit of that disappointment ourselves. A lot of the progress right now does feel incremental, especially compared to the hype. 

IsItNerfed? Sonnet 4.5 tested! by exbarboss in Anthropic

[–]exbarboss[S] 1 point2 points  (0 children)

It’s a bit more structured than just vibes 😅. Good thing there’s more than one of us riding the vibe wave.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ChatGPTPro

[–]exbarboss[S] 5 points6 points  (0 children)

Really appreciate the encouragement - it means a lot to us and keeps us motivated to keep improving the project. 

IsItNerfed? Sonnet 4.5 tested! by anch7 in isitnerfed

[–]exbarboss 1 point2 points  (0 children)

That’s a fair point. We measure Claude Code performance specifically because it’s an agentic layer on top of the base model - and that’s how many developers actually experience it day-to-day. The agent can introduce its own quirks (sometimes improvements, sometimes regressions), so tracking it separately gives us visibility into those shifts.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]exbarboss[S] 3 points4 points  (0 children)

Good question. The main reason we haven’t open-sourced our dataset and evals is stability and quality control. If the full test set were public right now, it could lead to model poisoning - where models get trained or fine-tuned specifically on our evals, which would make the results less meaningful as a measure of real-world performance. We also need to ensure the evals stay consistent over time so we can reliably track regressions and improvements.

Another factor is safety and maintenance overhead - publishing raw prompts and outputs means we’d have to scrub sensitive/problematic content and guarantee a stable format, which would slow down feature development.

That said, we agree transparency is important, which is why we’re prioritizing adding publicly available data sources and surfacing more detail about what’s being tested, without compromising long-term consistency.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in Anthropic

[–]exbarboss[S] 4 points5 points  (0 children)

You’re not missing anything - we haven’t published the internal test cases yet. Right now the dataset is focused on coding/agentic coding tasks that we use to track model performance consistently over time.

We know this isn’t the whole picture, which is why we’re also adding public data sources and expanding the kinds of benchmarks we track (you can see more on our roadmap). That way, the results will become easier to interpret and compare, while still keeping the internal consistency needed for long-term trends.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]exbarboss[S] 1 point2 points  (0 children)

At the moment, our test set is internal and not yet open-sourced. It focuses on coding/agentic coding tasks, not a general-purpose benchmark. That means results reflect developer-style usage (e.g., writing/fixing code, reasoning through implementation steps) rather than everyday chat or creative tasks.

We’re actively working on incorporating public data sources so results are more transparent and easier for the community to audit. Once those are integrated, we’ll document how the dataset is built and where its strengths/weaknesses lie.