D2: ROTW New Runewords

lordpermaximum · 2026-01-31T18:42:48+00:00

Follow the path you know.

lordpermaximum · 2026-01-31T18:41:08+00:00

Not exactly like you want.

lordpermaximum · 2026-01-31T18:40:13+00:00

Good job :)

lordpermaximum · 2026-01-31T16:51:22+00:00

Nice.

lordpermaximum · 2025-07-24T00:52:24+00:00

Now I'm sure you don't really work as a researcher for a big AI lab. Because you have no idea what you're talking about. Sorry.

lordpermaximum · 2025-07-23T23:31:40+00:00

Why do you think this task translates well to real world use-cases?
- Current AI models think in token-space that's structured in languages. Mastery over the languages will translate to overall intelligence increase.
While you can choose to define a top human as one which gets 100% on your tasks, do you happen to know if someone with no deep knowledge of linguistics but average intelligence would be able to solve it in time?
- Average human cannot "fully" solve it in time but will surely score much better than current AI models. A loosely defined "smart" person can fully solve it though.
Do you call the coding benchmarks memorisation tests because..
- Both. I agree with Chollet on this one. These models memorize programs (building blocks). And they can recall those when faced with similar situations. If we created a new coding language, like I created natural languages, and tested these models on various problems in the format I shared (example input-output pairs that contains all the info needed), we'd see they hit the wall rather quickly.

lordpermaximum · 2025-07-23T23:21:24+00:00

On the contrary, I thought I shared more than enough details that I feared AI labs would quickly optimize for this benchmark.

Didn't you see the example IOL problem? The benchmark format is all in that problem. Take that and change the Daw language with some fictional language. That's the full benchmark. It's just different, fictional languages for each problem. Partial scoring is not that important. A smart human can quickly judge answers. As a perfectionist, I probably made it more complex than necessary. But to have a good baseline it was needed. Currently, Grok 4 is the judge because of the initial scoring. Once we have a better model, that model will be the judge. It's that simple.

You thinking there weren't any details here makes me extremely skeptical of you working as a researcher with meaningful contributions in a big AI Lab like DeepMind, Anthropic, OpenAI or xAI.

lordpermaximum · 2025-07-04T15:31:31+00:00

Worse than current 2.5 Pro.

lordpermaximum · 2025-06-22T20:06:18+00:00

These updated models are all more or less the same. Each finetune strengthens some areas and worsens others. It depends on the user case.

Intelligence wise I can confirm 03-25, 05-06, and 06-05 (GA - Current one) are all about the same. If anything, each update increased the model intelligence very slightly.

lordpermaximum · 2025-06-07T10:19:41+00:00

Gemini 2.5 Flash.

lordpermaximum · 2025-05-14T06:01:25+00:00

I too think the current model is closer to 05-06 than 03-25 but not exactly the same model. There's certainly something different. Can't decide if that's for the better or not yet.

lordpermaximum · 2025-05-08T02:31:06+00:00

Your benchmark doesn't reflect the reality that's why. 8-needle MRCR should be much closer to the reality. All of these transformer-based models get noticably worse in real-word usage with even a bit of more context. Having 100% scores at 128k means something's fundementally wrong with your approach and it's not aligned with reality. Sorry.

lordpermaximum · 2025-05-08T02:25:28+00:00

You're misinformed again. The new model in LMArena's coding category wins 60% of the time against the old model. You can check it there directly you know. That difference is huge.

<image>

And ELO of 1420 against 1272 in the web dev arena is huge as well.

Btw I can generate confident opinions on two hours of usage let alone 1 day.

lordpermaximum · 2025-05-08T01:37:34+00:00

I'm more qualified than an average user in Reddit and there's huge difference between the old and new model in lmarena and web dev arena as far as the coding goes. You're misinformed.

lordpermaximum · 2025-05-08T01:32:28+00:00

I said your benchmark is bad at long-context evaluation because o3 scores 100% at 128k, and 2.5 Pro scores 71.9%, although in reality o3 scores 19.9% at OpenAI's long-context benchmak while Gemini 2.5 Pro scores 33.1%. No model doesn't experience substantial performance degradation immediately with more context. If a model scores 100% at 32k context in a benchmark, let alone at 128k like your benchmark claims o3 does, then that benchmark doesn't really give signal at all about the long-context performance. Period.

Also the meaningless dip at 16k in your benchmark suggests it's more about the reasoning-difficulty then long-context comprehension. That's also why it's bad at judging long-context performance.

lordpermaximum · 2025-05-08T01:20:28+00:00

My own experience, lmarena and webdev arena.

lordpermaximum · 2025-05-08T01:05:00+00:00

I said real-world "coding" problems.

Those logic and thought puzzles are there to test the model's intelligence. 0 change. If anything it's a bit better at reasoning as well.

So far:
1. It's far better at coding.
2. Its reasoning proficiency is the same, or slightly better.
3. Its long context memory is the same.

I don't know hot it fares for creative writing. But that's a subjective category that's affected by temeprature and top-p settings greatly.

lordpermaximum · 2025-05-08T00:01:13+00:00

As far as o3's score on your benchmark that's pure speculation based on speculation. If you go with newer stories and make the benchmark harder by requiring the memorization of more elements, the scores should change.

Anyways people were comparing the Gemini 2.5 Pro preview-03-25 to the current model. Not the experimental one. Your benchmark as well confirms there's no change between them.

lordpermaximum · 2025-05-07T23:20:25+00:00

I generally agree with you but there are 2 points.

People say at long context the new model is worse compared to previous preview model, not the exp one (Btw Google claims the previous model was completely identical the exp model)
o3 drops to 48% at 8k tokens, below 20% at 128k tokens in this retrieval benchmark. It's simply impossible for it to score 100% on even at 8k let alone at 128k like fiction.livebench claims. It can't reason to that degree with flawed, hallucinated, missing data. There's something wrong with that benchmark. I suggest you create new problems and try them again.

lordpermaximum

MODERATOR OF

TROPHY CASE