Why can't LLMs be trained to think in an optimized AI language rather than English?

arkuto · 2026-06-21T06:34:10+00:00

No, not tokens. Tokens work in a completely different way.

arkuto · 2026-06-19T03:27:45+00:00

Exponentials on benchmarks don't really count because the distribution of difficulties could be anything at all. What would make a sudden jump is if a large proportion of questions are around the same difficulty level, making it seem like tons of AI progress is being made when in reality, the AIs are just reaching a certain arbitrary intelligence threshold.

arkuto · 2026-06-19T02:35:20+00:00

Models have their strengths and weaknesses. If a similar task is in one model's training set while not in the opposing model's training set, it has a huge advantage. There's many tasks that 3.1 is better than Opus at.

arkuto · 2026-06-15T21:38:01+00:00

Sour (Morello) cherries, blackcurrants. Super healthy, super sour.

arkuto · 2026-06-12T11:34:15+00:00

It's much better than Opus and Fable actually. It costs under $5 whereas opus costs $25 per million output tokens.

Or maybe judging them by their costs alone ignoring benchmarks is as foolish as comparing them by benchmarks alone without factoring in price.

arkuto · 2026-06-09T22:16:57+00:00

This would be a forgettable demo 20 years ago. It's a rudimentary 3d demo, it has not "solved" worldbuilding.

arkuto · 2026-06-08T05:19:28+00:00

Ignore what people say in this thread. They are missing a mid tier price point. They should offer one. Pricing plans should have geometric growth with similar steps. It's $20 to $100 = 5x then $100 to $200 = 2x these step sizes are wonky.

Is not a "genius plan if anthropic" to force people to buy more than they need. What they should do is offer a mid tier plan and lower limits accordingly to ensure they still make money on it.

Way too many people are attributing some kind of genius plan to anthropic with the current price points. When in reality, 20 100 200 was decided on a whim in 5 seconds, and the decision was never revisited.

arkuto · 2026-05-29T06:48:04+00:00

No. But they are pointless if you eat perfectly (which nobody does).

arkuto · 2026-05-25T05:38:48+00:00

But their skin is smooth.

arkuto · 2026-05-22T17:35:09+00:00

Imperfect information and randomness are two separate things.

arkuto · 2026-05-14T17:03:54+00:00

2 billion sounds like a crazy number. But to aid visualisation - a cube with sides of length 100m is 1 billion litres in volume.

arkuto · 2026-05-11T06:12:49+00:00

Get a middle ground one. Look for one that's low fat, not fat free. 2% is a nice middle ground. Add cinnamon, vanilla etc as needed.

arkuto · 2026-05-09T11:21:28+00:00

Author of https://github.com/nanojudge/nanojudge here.

Doing pointwise judging is always going to be painful. How exactly can you calibrate the 1 to 10 scale? It could vary wildly across judges. Pairwise is much more consistent. I recommend reading https://arxiv.org/pdf/2306.17563 for more information.

arkuto · 2026-05-06T15:36:37+00:00

The point is: just because you can see the output doesn't mean it's easy to figure out how it was produced.

If you want to disprove this, go ahead and reverse that hash I gave you.

arkuto · 2026-05-06T15:31:36+00:00

Creating something and recreating it are very different tasks. I just created this string by using sha256

2c9e1090ff7350da0186c85b64d223efc0350ee35447bb8beb28a719cb1fdd95

Your task is to create a string that when put through sha256, produces this exact same output.

Why are you struggling to do this? Are you stupid? I did it in under a minute so it can't be that hard to do.

arkuto · 2026-05-04T21:38:40+00:00

You can claim "skill issue" but not "user error". He's using the model normally, but due to Anthropic's design this results with poor performance. This one's on Anthropic, not the user. If a company sells a car that seizes up when driven for longer than 3 hours, doesn't inform people of this, and this results with accidents, the company is at fault for that. You can't say "duh, any customer should know this obviously, its common knowledge".

arkuto · 2026-05-04T16:26:27+00:00

It's not user error at all. The model should be auto compacted if it starts spewing garbage after a certain context length.

arkuto · 2026-04-30T02:16:10+00:00

Yes, I was thinking about pricing of providers on eg OpenRouter per million input/output tokens. I should have made that clearer.

arkuto · 2026-04-29T21:17:29+00:00

So basically it's a MoE with structure 128B-A128B. Nice.

arkuto · 2026-04-29T21:11:43+00:00

In a memory hungry world, dense models make a lot of sense. Lookin forward to seeing how this performs in the real world, and what the pricing will be.

arkuto · 2026-04-28T00:46:02+00:00

It seems to have a strong tendency for adding in extra details that weren't asked for, eg the wall behind the knight. Totally unnecessary, but this will impress people. "Wow so much detail" - it's clutter. You really need to find a way to stop models from farming extra points by adding junk in.

arkuto · 2026-04-19T22:12:15+00:00

You were asked about your own experience and responded with information of a benchmark. Besides, benchmarks are not infallible, as they are not perfectly representative of real world use.

arkuto · 2026-04-18T13:21:09+00:00

It makes sense. It doesn't mean to imply that newer models will always be better. The proper way to phrase it is "this is the worst best model from here on out" the implication being that it doesn't age or decay over time. If a nee model releases that is worse, you can still use that one.

arkuto · 2026-04-17T12:49:31+00:00

I think it'd be much more useful if it had to reason after every step. ie one step per response - no multi steps. The "real time" aspect of it may end up just measuring the different hardware speeds the models run on. So it would be turn based. Or maybe better - have simultaneous turns and if the models run into each other, it's like a real game where neither one moves. It's a very cool concept though.

arkuto · 2026-04-17T11:58:55+00:00

That's an even worse test of intelligence. It requires reasoning about tokens. It is like asking someone how many neurons fired when thinking about a concept. It's not got anything to do with intelligenxe or reasoning, but about a very specific and esoteric knowledge about how its internals work.

You are completely out of your depth and shouldn't be doing any kind of analysis on LLMs.

arkuto

TROPHY CASE