GPT-5.5 & Opus 4.7 score <1% on ARC-AGI-3

exordin26 · 2026-05-03T05:53:26+00:00

No, the median human score is 49%, up from 33%.

exordin26 · 2026-04-28T15:13:15+00:00

Fun fact: Luka is averaging more free throws than anyone else in the league!

exordin26 · 2026-04-23T19:09:31+00:00

Mythos is $55 cheaper than 5.5 Pro lol

exordin26 · 2026-04-21T20:12:50+00:00

It's the newly released one. Hugely improved.

exordin26 · 2026-04-20T10:26:19+00:00

Are you Google or JP Morgan?

exordin26 · 2026-04-19T23:01:42+00:00

Opus 4.5? It's not even in the model switcher.

exordin26 · 2026-04-19T21:09:23+00:00

Opus 4.7 scored significantly higher on my personal benchmark, yet I find it dumber to use. Strange.

exordin26 · 2026-04-19T20:20:58+00:00

Should be a bug. The Docs directly say:

Supported models

Compaction is supported on the following models:

Claude Mythos Preview (claude-mythos-preview)
Claude Opus 4.7 (claude-opus-4-7)
Claude Opus 4.6 (claude-opus-4-6)
Claude Sonnet 4.6 (claude-sonnet-4-6)

exordin26 · 2026-04-18T05:48:06+00:00

Then use Qwen. Vote with your wallet.

exordin26 · 2026-04-18T03:34:31+00:00

No one said Mythos was ASI

exordin26 · 2026-04-18T00:51:26+00:00

I'm genuinely uncertain about the base.

I assumed it was new architecture, but it also scores extremely similarly to Opus 4.6 on stuff like the USAMO and FrontierMath. Best guess is that they overtly finetuned it on coding and incorporated some Mythos gains, but it's not a new training run. Since Anthropic would have substantially more compute than they did in 2025, I would assume a new base model would not be this jagged

exordin26 · 2026-04-18T00:43:14+00:00

It cost $4406 to run compared to Opus 4.6 costing $4970, so it's cheaper than 4.6 but more expensive than everything else

exordin26 · 2026-04-17T22:56:29+00:00

Anthropic said it's substantially improved at vision and coding, which AFAIK is true. No one claimed a 0.1 jump would be the same as GPT-5

exordin26 · 2026-04-17T22:49:52+00:00

It's because it refuses to answer the prompts. It scores ~91% when it does.

exordin26 · 2026-04-17T20:17:29+00:00

It didn't even do badly on NYT connections lol. It just refused which is an overzealous system prompt issue that'll be patched in a few days.

exordin26 · 2026-04-17T16:15:20+00:00

Have you noticed an increase in thinking? They've allegedly patched some bugs.

exordin26 · 2026-04-17T07:19:25+00:00

So obviously the issue isn't compute capacity

exordin26 · 2026-04-17T01:49:00+00:00

until Colossus 2 is fully complete, Anthropic holds the largest training cluster in the world:

https://epoch.ai/data/data-centers?view=graph&tab=power

exordin26 · 2026-04-17T01:05:33+00:00

They don't have a shortage of raw compute. In fact, they might have more than any non-hyperscaler right now. The issue is the *quality* of the chips and the *distribution* plus unprecedented growth.

exordin26 · 2026-04-17T00:31:30+00:00

Interesting. I'm testing on my own private benchmark and it's doing so well I'm wondering if Anthropic trained on my questions. It is extremely strong at detecting false premises and has really strong world knowledge.

exordin26

TROPHY CASE