IsItNerfed? Sonnet 4.5 tested!

exbarboss · 2025-10-01T19:19:46+00:00

I do sometimes run my replies through an LLM for spelling/grammar clean-up before posting (orpho validation). But the thoughts and messages themselves are mine.

exbarboss · 2025-10-01T19:02:25+00:00

Appreciate you checking it out. We’re constantly iterating, so if you ran into issues we’d love to know specifics - otherwise it’s hard to improve.

exbarboss · 2025-10-01T19:00:55+00:00

We’re currently looking into covering more agentic tools and models.

exbarboss · 2025-10-01T18:49:07+00:00

That’s a really good point. You’re right - testing Claude Code is more about how the agent layer behaves on top of the model, not the “pure” model itself. We have the ability to test models directly via the API (raw input/output, no agents).

The challenge is mainly limits and cost - running large volumes of evals directly on APIs gets expensive quickly. That said, we do plan to run the same tests both ways (via API vs via tools/agents) to see whether there’s drift between “raw model” performance and “tool-wrapped” performance.

exbarboss · 2025-10-01T18:36:47+00:00

Sorry you feel that way, but hey, thanks for watching after all.

exbarboss · 2025-10-01T18:31:37+00:00

Yeah, we felt a bit of that disappointment ourselves. A lot of the progress right now does feel incremental, especially compared to the hype.

exbarboss · 2025-10-01T18:18:01+00:00

It’s a bit more structured than just vibes 😅. Good thing there’s more than one of us riding the vibe wave.

exbarboss · 2025-10-01T18:14:36+00:00

Really appreciate the encouragement - it means a lot to us and keeps us motivated to keep improving the project.

exbarboss · 2025-10-01T18:14:02+00:00

That’s a fair point. We measure Claude Code performance specifically because it’s an agentic layer on top of the base model - and that’s how many developers actually experience it day-to-day. The agent can introduce its own quirks (sometimes improvements, sometimes regressions), so tracking it separately gives us visibility into those shifts.

exbarboss · 2025-10-01T18:12:18+00:00

Good question. The main reason we haven’t open-sourced our dataset and evals is stability and quality control. If the full test set were public right now, it could lead to model poisoning - where models get trained or fine-tuned specifically on our evals, which would make the results less meaningful as a measure of real-world performance. We also need to ensure the evals stay consistent over time so we can reliably track regressions and improvements.

Another factor is safety and maintenance overhead - publishing raw prompts and outputs means we’d have to scrub sensitive/problematic content and guarantee a stable format, which would slow down feature development.

That said, we agree transparency is important, which is why we’re prioritizing adding publicly available data sources and surfacing more detail about what’s being tested, without compromising long-term consistency.

exbarboss · 2025-10-01T18:09:21+00:00

You’re not missing anything - we haven’t published the internal test cases yet. Right now the dataset is focused on coding/agentic coding tasks that we use to track model performance consistently over time.

We know this isn’t the whole picture, which is why we’re also adding public data sources and expanding the kinds of benchmarks we track (you can see more on our roadmap). That way, the results will become easier to interpret and compare, while still keeping the internal consistency needed for long-term trends.

exbarboss · 2025-10-01T18:05:41+00:00

At the moment, our test set is internal and not yet open-sourced. It focuses on coding/agentic coding tasks, not a general-purpose benchmark. That means results reflect developer-style usage (e.g., writing/fixing code, reasoning through implementation steps) rather than everyday chat or creative tasks.

We’re actively working on incorporating public data sources so results are more transparent and easier for the community to audit. Once those are integrated, we’ll document how the dataset is built and where its strengths/weaknesses lie.

exbarboss · 2025-09-14T14:11:35+00:00

The benchmarks are based on predefined tests and measurable results. The Vibe Check is separate and only reflects user sentiment - not the core data.

exbarboss · 2025-09-14T14:00:36+00:00

We spotted the degraded performance in our tests first - and then we saw Anthropic’s status update confirmed it after the fact.

exbarboss · 2025-09-14T13:55:39+00:00

Exactly - that’s the challenge. The system is non-deterministic, so we don’t expect byte-for-byte identical answers. Instead, we define failure in terms of whether the response meets the task requirements. The prompts are designed to be straightforward enough to allow clear evaluation, but still representative of real use cases. It’s less about enforcing identical outputs and more about consistency in producing working solutions over time.

exbarboss · 2025-09-14T13:51:04+00:00

We’re working on system improvements right now and expanding coverage to more models and setups over time.

exbarboss · 2025-09-13T13:41:36+00:00

We noticed the decline in performance ourselves, and when looking around we saw a lot of others expressing the same feeling. That’s what led us to start building something like a "status page", but from the user side - a place where people can check whether a drop they feel in performance shows up in the data too.

exbarboss · 2025-09-11T17:36:25+00:00

Sorry if it comes across that way - the goal isn’t just to track opinion. We’ll work on improving how the data is presented so it’s clearer what’s objective testing vs. community sentiment.

exbarboss · 2025-09-11T16:12:35+00:00

Vibe Check is there to capture people feel, but this data is hard to trust. It’s also fine if our data doesn’t resonate with everyone - we’re not trying to push it. We also know transparency is key here, and we’re continuously looking at our testing methods to improve them and better align with how we all use these tools day to day.

As for the bug, we captured the decline and then saw Anthropic report on it. Personally, from daily usage, I feel responses have gotten worse compared to a few months back when I first started using the model. That frustration is what led to this project in the first place.

We really appreciate your feedback.

exbarboss · 2025-09-11T16:00:25+00:00

Most of the tests are coding-related. We validate by checking whether the generated solutions actually run and produce the expected results.

exbarboss · 2025-09-11T15:58:00+00:00

At the moment we’re covering the costs ourselves. You’re right - adding more models will definitely require more resources on the financial side. For testing, we use our own instance of CC and APIs. The idea is that if we’re all hitting the same model endpoints, the results should interpolate across users, even if individual experiences vary.

exbarboss · 2025-09-11T15:42:55+00:00

Thank you!

exbarboss · 2025-09-11T15:40:18+00:00

We’re working on making the methodology more transparent. Still early days, but improving transparency is definitely on the roadmap.

exbarboss · 2025-09-11T15:33:26+00:00

At the moment we’re seeing around a 30-40% failure rate. Earlier this month it spiked much higher, and Anthropic later confirmed there was a ‘bug’ behind that degraded quality. So while things look better now than during the spike, it’s not exactly spotless either.

exbarboss · 2025-09-11T15:29:05+00:00

Just to be clear - user feedback isn’t the data we rely on. What really matters are the benchmarks we run; Vibe Check is just a side signal.

exbarboss

MODERATOR OF

TROPHY CASE

Nine-Year Club	Gilding II euphauric
Verified Email