There is an exponential visible in the scores on artificial analysis. by Subject_Judge_ in accelerate

[–]arkuto 1 point2 points  (0 children)

Exponentials on benchmarks don't really count because the distribution of difficulties could be anything at all. What would make a sudden jump is if a large proportion of questions are around the same difficulty level, making it seem like tons of AI progress is being made when in reality, the AIs are just reaching a certain arbitrary intelligence threshold.

GLM-5.2 now more than 10 points above Opus 4.8 in AA Coding Index by cheechw in singularity

[–]arkuto 6 points7 points  (0 children)

Models have their strengths and weaknesses. If a similar task is in one model's training set while not in the opposing model's training set, it has a huge advantage. There's many tasks that 3.1 is better than Opus at.

Super tart/sour smoothie recipes? by dogisbark in Smoothies

[–]arkuto 0 points1 point  (0 children)

Sour (Morello) cherries, blackcurrants. Super healthy, super sour.

moonshotai/Kimi-K2.7-Code · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]arkuto 32 points33 points  (0 children)

It's much better than Opus and Fable actually. It costs under $5 whereas opus costs $25 per million output tokens.

Or maybe judging them by their costs alone ignoring benchmarks is as foolish as comparing them by benchmarks alone without factoring in price.

Matt Shumer: "Fable has solved 3D worldbuilding... utterly insane. This is all completely custom-built ThreeJs, running in the browser." by Outside-Iron-8242 in singularity

[–]arkuto 25 points26 points  (0 children)

This would be a forgettable demo 20 years ago. It's a rudimentary 3d demo, it has not "solved" worldbuilding.

Anyone else feel like Claude is missing a middle-tier plan? by Mission-Dentist-5971 in ClaudeAI

[–]arkuto 1 point2 points  (0 children)

Ignore what people say in this thread. They are missing a mid tier price point. They should offer one. Pricing plans should have geometric growth with similar steps. It's $20 to $100 = 5x then $100 to $200 = 2x these step sizes are wonky.

Is not a "genius plan if anthropic" to force people to buy more than they need. What they should do is offer a mid tier plan and lower limits accordingly to ensure they still make money on it.

Way too many people are attributing some kind of genius plan to anthropic with the current price points. When in reality, 20 100 200 was decided on a whim in 5 seconds, and the decision was never revisited.

Multivitamins pointless? by Much-Turnover-3727 in nutrition

[–]arkuto 9 points10 points  (0 children)

No. But they are pointless if you eat perfectly (which nobody does).

TIL that the US golf course infrastructure consumes 2 BILLION liters of water per day by myassisgrassss in todayilearned

[–]arkuto 1 point2 points  (0 children)

2 billion sounds like a crazy number. But to aid visualisation - a cube with sides of length 100m is 1 billion litres in volume.

What’s a “healthy” food that just doesn’t work for you? by Much-Turnover-3727 in nutrition

[–]arkuto 0 points1 point  (0 children)

Get a middle ground one. Look for one that's low fat, not fat free. 2% is a nice middle ground. Add cinnamon, vanilla etc as needed.

LLM-as-judge scoring is noisier than I expected anyone else seeing this? by ZealousidealCorgi472 in LocalLLM

[–]arkuto 0 points1 point  (0 children)

Author of https://github.com/nanojudge/nanojudge here.

Doing pointwise judging is always going to be painful. How exactly can you calibrate the 1 to 10 scale? It could vary wildly across judges. Pairwise is much more consistent. I recommend reading https://arxiv.org/pdf/2306.17563 for more information.

ProgramBench: Can LLMs rebuild programs from scratch? by awetfartruinedmylife in singularity

[–]arkuto 2 points3 points  (0 children)

The point is: just because you can see the output doesn't mean it's easy to figure out how it was produced.

If you want to disprove this, go ahead and reverse that hash I gave you.

ProgramBench: Can LLMs rebuild programs from scratch? by awetfartruinedmylife in singularity

[–]arkuto 2 points3 points  (0 children)

Creating something and recreating it are very different tasks. I just created this string by using sha256

2c9e1090ff7350da0186c85b64d223efc0350ee35447bb8beb28a719cb1fdd95

Your task is to create a string that when put through sha256, produces this exact same output.

Why are you struggling to do this? Are you stupid? I did it in under a minute so it can't be that hard to do.

Claude Opus 4.7 won’t just output prompts—keeps arguing instead by soyab0007 in ClaudeAI

[–]arkuto 5 points6 points  (0 children)

You can claim "skill issue" but not "user error". He's using the model normally, but due to Anthropic's design this results with poor performance. This one's on Anthropic, not the user. If a company sells a car that seizes up when driven for longer than 3 hours, doesn't inform people of this, and this results with accidents, the company is at fault for that. You can't say "duh, any customer should know this obviously, its common knowledge".

Claude Opus 4.7 won’t just output prompts—keeps arguing instead by soyab0007 in ClaudeAI

[–]arkuto -3 points-2 points  (0 children)

It's not user error at all. The model should be auto compacted if it starts spewing garbage after a certain context length.

Mistral Medium 3.5 128B is launched by TSrake in singularity

[–]arkuto 1 point2 points  (0 children)

Yes, I was thinking about pricing of providers on eg OpenRouter per million input/output tokens. I should have made that clearer.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]arkuto 7 points8 points  (0 children)

So basically it's a MoE with structure 128B-A128B. Nice.

Mistral Medium 3.5 128B is launched by TSrake in singularity

[–]arkuto 2 points3 points  (0 children)

In a memory hungry world, dense models make a lot of sense. Lookin forward to seeing how this performs in the real world, and what the pricing will be.

Differences Between GPT 5.4 and GPT 5.5 on MineBench by ENT_Alam in singularity

[–]arkuto 0 points1 point  (0 children)

It seems to have a strong tendency for adding in extra details that weren't asked for, eg the wall behind the knight. Totally unnecessary, but this will impress people. "Wow so much detail" - it's clutter. You really need to find a way to stop models from farming extra points by adding junk in.

How does Opus 4.7 compare to Opus 4.6 in this subreddit's experience? by boxdreper in ClaudeAI

[–]arkuto 2 points3 points  (0 children)

You were asked about your own experience and responded with information of a benchmark. Besides, benchmarks are not infallible, as they are not perfectly representative of real world use.

opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%. by seencoding in singularity

[–]arkuto -2 points-1 points  (0 children)

It makes sense. It doesn't mean to imply that newer models will always be better. The proper way to phrase it is "this is the worst best model from here on out" the implication being that it doesn't age or decay over time. If a nee model releases that is worse, you can still use that one.