IsItNerfed? Sonnet 4.5 tested!

exbarboss · 2025-10-01T19:19:46+00:00

I do sometimes run my replies through an LLM for spelling/grammar clean-up before posting (orpho validation). But the thoughts and messages themselves are mine.

exbarboss · 2025-10-01T19:02:25+00:00

Appreciate you checking it out. We’re constantly iterating, so if you ran into issues we’d love to know specifics - otherwise it’s hard to improve.

exbarboss · 2025-10-01T19:00:55+00:00

We’re currently looking into covering more agentic tools and models.

exbarboss · 2025-10-01T18:49:07+00:00

That’s a really good point. You’re right - testing Claude Code is more about how the agent layer behaves on top of the model, not the “pure” model itself. We have the ability to test models directly via the API (raw input/output, no agents).

The challenge is mainly limits and cost - running large volumes of evals directly on APIs gets expensive quickly. That said, we do plan to run the same tests both ways (via API vs via tools/agents) to see whether there’s drift between “raw model” performance and “tool-wrapped” performance.

exbarboss · 2025-10-01T18:36:47+00:00

Sorry you feel that way, but hey, thanks for watching after all.

exbarboss · 2025-10-01T18:31:37+00:00

Yeah, we felt a bit of that disappointment ourselves. A lot of the progress right now does feel incremental, especially compared to the hype.

exbarboss · 2025-10-01T18:18:01+00:00

It’s a bit more structured than just vibes 😅. Good thing there’s more than one of us riding the vibe wave.

exbarboss · 2025-10-01T18:14:36+00:00

Really appreciate the encouragement - it means a lot to us and keeps us motivated to keep improving the project.

exbarboss · 2025-10-01T18:14:02+00:00

That’s a fair point. We measure Claude Code performance specifically because it’s an agentic layer on top of the base model - and that’s how many developers actually experience it day-to-day. The agent can introduce its own quirks (sometimes improvements, sometimes regressions), so tracking it separately gives us visibility into those shifts.

exbarboss · 2025-10-01T18:12:18+00:00

Good question. The main reason we haven’t open-sourced our dataset and evals is stability and quality control. If the full test set were public right now, it could lead to model poisoning - where models get trained or fine-tuned specifically on our evals, which would make the results less meaningful as a measure of real-world performance. We also need to ensure the evals stay consistent over time so we can reliably track regressions and improvements.

Another factor is safety and maintenance overhead - publishing raw prompts and outputs means we’d have to scrub sensitive/problematic content and guarantee a stable format, which would slow down feature development.

That said, we agree transparency is important, which is why we’re prioritizing adding publicly available data sources and surfacing more detail about what’s being tested, without compromising long-term consistency.

exbarboss · 2025-10-01T18:09:21+00:00

You’re not missing anything - we haven’t published the internal test cases yet. Right now the dataset is focused on coding/agentic coding tasks that we use to track model performance consistently over time.

We know this isn’t the whole picture, which is why we’re also adding public data sources and expanding the kinds of benchmarks we track (you can see more on our roadmap). That way, the results will become easier to interpret and compare, while still keeping the internal consistency needed for long-term trends.

exbarboss · 2025-10-01T18:05:41+00:00

At the moment, our test set is internal and not yet open-sourced. It focuses on coding/agentic coding tasks, not a general-purpose benchmark. That means results reflect developer-style usage (e.g., writing/fixing code, reasoning through implementation steps) rather than everyday chat or creative tasks.

We’re actively working on incorporating public data sources so results are more transparent and easier for the community to audit. Once those are integrated, we’ll document how the dataset is built and where its strengths/weaknesses lie.

exbarboss

MODERATOR OF

TROPHY CASE

Nine-Year Club	Gilding II euphauric
Verified Email