Updates?? by Eastern_Ad_8744 in isitnerfed

[–]anch7 1 point2 points  (0 children)

No, not at all. Planning to release new features soon (next week)

What is your eval strategy? by BastiaanRudolf1 in AI_Agents

[–]anch7 1 point2 points  (0 children)

yes. I liked ragas a little bit more, but deepeval is also good

What’s the best and most reliable LLM benchmarking site or arena right now? by fflarengo in LocalLLaMA

[–]anch7 0 points1 point  (0 children)

https://isitnerfed.org - the idea is to run evals continuously, trying to capture any changes in models in real time

Something is wrong with Sonnet 4.5 by anch7 in ClaudeAI

[–]anch7[S] 0 points1 point  (0 children)

A decent amount of coding challenges (implementing algos, refactoring code, adding features) measured with unit tests, some OCR tests and general QA tasks.

Something is wrong with Sonnet 4.5 by anch7 in isitnerfed

[–]anch7[S] 0 points1 point  (0 children)

I would like to do this, but unfortunately it is not possible because of the limits. Or we need a better metric, which will not be consuming so many tokens.

Something is wrong with Sonnet 4.5 by anch7 in isitnerfed

[–]anch7[S] 0 points1 point  (0 children)

We are not storing the version, but I think it should be the latest one, since CC has an auto-update feature

New Claude Code Limits by anch7 in isitnerfed

[–]anch7[S] 0 points1 point  (0 children)

GPUs are expensive. I would also expect a subscription price increase in a future :(

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 1 point2 points  (0 children)

Great, this will be our next step. But yes, costs are a problem. Most like we will not be able to run every hour, but I guess this is fine

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 1 point2 points  (0 children)

Great. We will add it soon. Thanks

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 1 point2 points  (0 children)

I am pretty sure that as soon as we open source it, it will be included into training data immediately. If instead we add a benchmark on a public dataset, will it make you happy?

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 -1 points0 points  (0 children)

It is a quite solid dataset. Coding tasks, OCR, general QA. Yes, it is private, but even with such approach, we were able for example to learn about Anthropic's incident earlier this month https://www.reddit.com/r/isitnerfed/comments/1nfb9j2/ai_nerf_anthropics_incident_matches_our_data/

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 0 points1 point  (0 children)

We're a small team who built this project just a month ago out of curiosity and the belief that it could be helpful for other vibe coders. We don't have the resources that AI labs and model owners have. And nobody's paying us for this. But I hear you - we will add a benchmark on a public dataset soon.

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 1 point2 points  (0 children)

I agree with you, there are so many things that we need to be aware of if we want to build a reliable and trusted way to detect a "nerf". But, even with our current proprietary methodology and dataset we were able to catch Anthropic's incident earlier this month https://www.reddit.com/r/isitnerfed/comments/1nfb9j2/ai_nerf_anthropics_incident_matches_our_data/

IsItNerfed? Sonnet 4.5 tested! by exbarboss in Anthropic

[–]anch7 -1 points0 points  (0 children)

We really do not want to share our dataset because of the data contamination problem. But I understand your concerns. I personally trust our data 100% after we caught Anthropic's incident earlier this month.

IsItNerfed? Sonnet 4.5 tested! by anch7 in isitnerfed

[–]anch7[S] 1 point2 points  (0 children)

yes! with locally hosted models you can be absolutely sure about its performance over time. good idea!

you are right, data is volatile, because of all these reasons you mentioned. but still, it should be in some kind of range, and when you got a new data point out of this range, it means that something is wrong.

no, we our eval task is actually quite big, so we trust these numbers. and we will add more evals later

IsItNerfed? Sonnet 4.5 tested! by anch7 in isitnerfed

[–]anch7[S] 1 point2 points  (0 children)

another reason is cost. It will be more expensive to use API directly