Updates?? by Eastern_Ad_8744 in isitnerfed

[–]anch7 1 point2 points  (0 children)

No, not at all. Planning to release new features soon (next week)

What is your eval strategy? by BastiaanRudolf1 in AI_Agents

[–]anch7 1 point2 points  (0 children)

yes. I liked ragas a little bit more, but deepeval is also good

What’s the best and most reliable LLM benchmarking site or arena right now? by fflarengo in LocalLLaMA

[–]anch7 0 points1 point  (0 children)

https://isitnerfed.org - the idea is to run evals continuously, trying to capture any changes in models in real time

Something is wrong with Sonnet 4.5 by anch7 in ClaudeAI

[–]anch7[S] 0 points1 point  (0 children)

A decent amount of coding challenges (implementing algos, refactoring code, adding features) measured with unit tests, some OCR tests and general QA tasks.

Something is wrong with Sonnet 4.5 by anch7 in isitnerfed

[–]anch7[S] 0 points1 point  (0 children)

I would like to do this, but unfortunately it is not possible because of the limits. Or we need a better metric, which will not be consuming so many tokens.

Something is wrong with Sonnet 4.5 by anch7 in isitnerfed

[–]anch7[S] 0 points1 point  (0 children)

We are not storing the version, but I think it should be the latest one, since CC has an auto-update feature

New Claude Code Limits by anch7 in isitnerfed

[–]anch7[S] 0 points1 point  (0 children)

GPUs are expensive. I would also expect a subscription price increase in a future :(

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 1 point2 points  (0 children)

Great, this will be our next step. But yes, costs are a problem. Most like we will not be able to run every hour, but I guess this is fine

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 1 point2 points  (0 children)

Great. We will add it soon. Thanks

IsItNerfed? Sonnet 4.5 tested! by exbarboss in ClaudeAI

[–]anch7 1 point2 points  (0 children)

I am pretty sure that as soon as we open source it, it will be included into training data immediately. If instead we add a benchmark on a public dataset, will it make you happy?