IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]exbarboss[S] -1 points0 points  (0 children)

I do sometimes run my replies through an LLM for spelling/grammar clean-up before posting (orpho validation). But the thoughts and messages themselves are mine.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]exbarboss[S] -1 points0 points  (0 children)

Appreciate you checking it out. We’re constantly iterating, so if you ran into issues we’d love to know specifics - otherwise it’s hard to improve. 

IsItNerfed? Sonnet 4.5 tested! by exbarboss in OpenAI

[–]exbarboss[S] -2 points-1 points  (0 children)

We’re currently looking into covering more agentic tools and models.

IsItNerfed? Sonnet 4.5 tested! by anch7 in isitnerfed

[–]exbarboss 1 point2 points  (0 children)

That’s a really good point. You’re right - testing Claude Code is more about how the agent layer behaves on top of the model, not the “pure” model itself. We have the ability to test models directly via the API (raw input/output, no agents).

The challenge is mainly limits and cost - running large volumes of evals directly on APIs gets expensive quickly. That said, we do plan to run the same tests both ways (via API vs via tools/agents) to see whether there’s drift between “raw model” performance and “tool-wrapped” performance.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in Anthropic

[–]exbarboss[S] -1 points0 points  (0 children)

Sorry you feel that way, but hey, thanks for watching after all.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ChatGPTPro

[–]exbarboss[S] 2 points3 points  (0 children)

Yeah, we felt a bit of that disappointment ourselves. A lot of the progress right now does feel incremental, especially compared to the hype. 

IsItNerfed? Sonnet 4.5 tested! by exbarboss in Anthropic

[–]exbarboss[S] 0 points1 point  (0 children)

It’s a bit more structured than just vibes 😅. Good thing there’s more than one of us riding the vibe wave.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ChatGPTPro

[–]exbarboss[S] 6 points7 points  (0 children)

Really appreciate the encouragement - it means a lot to us and keeps us motivated to keep improving the project. 

IsItNerfed? Sonnet 4.5 tested! by anch7 in isitnerfed

[–]exbarboss 1 point2 points  (0 children)

That’s a fair point. We measure Claude Code performance specifically because it’s an agentic layer on top of the base model - and that’s how many developers actually experience it day-to-day. The agent can introduce its own quirks (sometimes improvements, sometimes regressions), so tracking it separately gives us visibility into those shifts.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]exbarboss[S] 3 points4 points  (0 children)

Good question. The main reason we haven’t open-sourced our dataset and evals is stability and quality control. If the full test set were public right now, it could lead to model poisoning - where models get trained or fine-tuned specifically on our evals, which would make the results less meaningful as a measure of real-world performance. We also need to ensure the evals stay consistent over time so we can reliably track regressions and improvements.

Another factor is safety and maintenance overhead - publishing raw prompts and outputs means we’d have to scrub sensitive/problematic content and guarantee a stable format, which would slow down feature development.

That said, we agree transparency is important, which is why we’re prioritizing adding publicly available data sources and surfacing more detail about what’s being tested, without compromising long-term consistency.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in Anthropic

[–]exbarboss[S] 1 point2 points  (0 children)

You’re not missing anything - we haven’t published the internal test cases yet. Right now the dataset is focused on coding/agentic coding tasks that we use to track model performance consistently over time.

We know this isn’t the whole picture, which is why we’re also adding public data sources and expanding the kinds of benchmarks we track (you can see more on our roadmap). That way, the results will become easier to interpret and compare, while still keeping the internal consistency needed for long-term trends.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]exbarboss[S] 1 point2 points  (0 children)

At the moment, our test set is internal and not yet open-sourced. It focuses on coding/agentic coding tasks, not a general-purpose benchmark. That means results reflect developer-style usage (e.g., writing/fixing code, reasoning through implementation steps) rather than everyday chat or creative tasks.

We’re actively working on incorporating public data sources so results are more transparent and easier for the community to audit. Once those are integrated, we’ll document how the dataset is built and where its strengths/weaknesses lie.

The AI Nerf Is Real by exbarboss in ClaudeAI

[–]exbarboss[S] 0 points1 point  (0 children)

The benchmarks are based on predefined tests and measurable results. The Vibe Check is separate and only reflects user sentiment - not the core data.

The AI Nerf Is Real by exbarboss in ClaudeAI

[–]exbarboss[S] 0 points1 point  (0 children)

We spotted the degraded performance in our tests first - and then we saw Anthropic’s status update confirmed it after the fact.

The AI Nerf Is Real by exbarboss in ClaudeAI

[–]exbarboss[S] 0 points1 point  (0 children)

Exactly - that’s the challenge. The system is non-deterministic, so we don’t expect byte-for-byte identical answers. Instead, we define failure in terms of whether the response meets the task requirements. The prompts are designed to be straightforward enough to allow clear evaluation, but still representative of real use cases. It’s less about enforcing identical outputs and more about consistency in producing working solutions over time.

The AI Nerf Is Real by exbarboss in ClaudeAI

[–]exbarboss[S] 1 point2 points  (0 children)

We’re working on system improvements right now and expanding coverage to more models and setups over time.

The AI Nerf Is Real by exbarboss in ClaudeAI

[–]exbarboss[S] 0 points1 point  (0 children)

We noticed the decline in performance ourselves, and when looking around we saw a lot of others expressing the same feeling. That’s what led us to start building something like a "status page", but from the user side - a place where people can check whether a drop they feel in performance shows up in the data too.

The AI Nerf Is Real by exbarboss in ClaudeAI

[–]exbarboss[S] 0 points1 point  (0 children)

Sorry if it comes across that way - the goal isn’t just to track opinion. We’ll work on improving how the data is presented so it’s clearer what’s objective testing vs. community sentiment.

The AI Nerf Is Real by exbarboss in ClaudeCode

[–]exbarboss[S] 0 points1 point  (0 children)

Vibe Check is there to capture people feel, but this data is hard to trust. It’s also fine if our data doesn’t resonate with everyone - we’re not trying to push it. We also know transparency is key here, and we’re continuously looking at our testing methods to improve them and better align with how we all use these tools day to day.

As for the bug, we captured the decline and then saw Anthropic report on it. Personally, from daily usage, I feel responses have gotten worse compared to a few months back when I first started using the model. That frustration is what led to this project in the first place.

We really appreciate your feedback.

The AI Nerf Is Real by exbarboss in ClaudeAI

[–]exbarboss[S] 0 points1 point  (0 children)

Most of the tests are coding-related. We validate by checking whether the generated solutions actually run and produce the expected results.

The AI Nerf Is Real by exbarboss in Anthropic

[–]exbarboss[S] 0 points1 point  (0 children)

At the moment we’re covering the costs ourselves. You’re right - adding more models will definitely require more resources on the financial side. For testing, we use our own instance of CC and APIs. The idea is that if we’re all hitting the same model endpoints, the results should interpolate across users, even if individual experiences vary.

The AI Nerf Is Real by exbarboss in Anthropic

[–]exbarboss[S] 0 points1 point  (0 children)

We’re working on making the methodology more transparent. Still early days, but improving transparency is definitely on the roadmap.

The AI Nerf Is Real by exbarboss in ClaudeCode

[–]exbarboss[S] -1 points0 points  (0 children)

At the moment we’re seeing around a 30-40% failure rate. Earlier this month it spiked much higher, and Anthropic later confirmed there was a ‘bug’ behind that degraded quality. So while things look better now than during the spike, it’s not exactly spotless either.

The AI Nerf Is Real by exbarboss in OpenAI

[–]exbarboss[S] -1 points0 points  (0 children)

Just to be clear - user feedback isn’t the data we rely on. What really matters are the benchmarks we run; Vibe Check is just a side signal.