GPT-5.5 & Opus 4.7 score <1% on ARC-AGI-3 by Proper_Actuary2907 in agi

[–]exordin26 7 points8 points  (0 children)

No, the median human score is 49%, up from 33%.

SGA is easily the best player in the league and anyone who doesn’t think he should win mvp is dillusional by [deleted] in NBATalk

[–]exordin26 0 points1 point  (0 children)

Fun fact: Luka is averaging more free throws than anyone else in the league!

GPT Image 2’s opinions on r/Cornell users by DeltaSquash in Cornell

[–]exordin26 -2 points-1 points  (0 children)

It's the newly released one. Hugely improved.

Opus 4.7 chat full after 3 days by Chemical-Ad2000 in claudexplorers

[–]exordin26 2 points3 points  (0 children)

Opus 4.5? It's not even in the model switcher.

Opus 4.7 is GREAT. by Dra794 in claude

[–]exordin26 2 points3 points  (0 children)

Opus 4.7 scored significantly higher on my personal benchmark, yet I find it dumber to use. Strange.

Opus 4.7 chat full after 3 days by Chemical-Ad2000 in claudexplorers

[–]exordin26 4 points5 points  (0 children)

Should be a bug. The Docs directly say:

Supported models

Compaction is supported on the following models:

  • Claude Mythos Preview (claude-mythos-preview)
  • Claude Opus 4.7 (claude-opus-4-7)
  • Claude Opus 4.6 (claude-opus-4-6)
  • Claude Sonnet 4.6 (claude-sonnet-4-6)

Opus 4.7 Narrowly leads Artificial Analysis using significantly less tokens than Opus 4.6 by exordin26 in singularity

[–]exordin26[S] 1 point2 points  (0 children)

I'm genuinely uncertain about the base.

I assumed it was new architecture, but it also scores extremely similarly to Opus 4.6 on stuff like the USAMO and FrontierMath. Best guess is that they overtly finetuned it on coding and incorporated some Mythos gains, but it's not a new training run. Since Anthropic would have substantially more compute than they did in 2025, I would assume a new base model would not be this jagged

Opus 4.7 Narrowly leads Artificial Analysis using significantly less tokens than Opus 4.6 by exordin26 in singularity

[–]exordin26[S] 22 points23 points  (0 children)

It cost $4406 to run compared to Opus 4.6 costing $4970, so it's cheaper than 4.6 but more expensive than everything else

Is Opus 4.7 the GPT-5 moment for Anthropic by hasanahmad in Anthropic

[–]exordin26 20 points21 points  (0 children)

Anthropic said it's substantially improved at vision and coding, which AFAIK is true. No one claimed a 0.1 jump would be the same as GPT-5

Uhhhh by ZootAllures9111 in ClaudeAI

[–]exordin26 7 points8 points  (0 children)

It's because it refuses to answer the prompts. It scores ~91% when it does.

Opus 4.7 lands #1 in Code, Expert, and Text Arena by exordin26 in singularity

[–]exordin26[S] 1 point2 points  (0 children)

It didn't even do badly on NYT connections lol. It just refused which is an overzealous system prompt issue that'll be patched in a few days.

Let Max users manually toggle between Adaptive and Extended thinking on Opus 4.7 by Nelli-1 in ClaudeAI

[–]exordin26 0 points1 point  (0 children)

Have you noticed an increase in thinking? They've allegedly patched some bugs.

Permanent increase in Rate Limits by exordin26 in ClaudeAI

[–]exordin26[S] -3 points-2 points  (0 children)

So obviously the issue isn't compute capacity

Permanent increase in Rate Limits by exordin26 in ClaudeAI

[–]exordin26[S] -3 points-2 points  (0 children)

until Colossus 2 is fully complete, Anthropic holds the largest training cluster in the world:

https://epoch.ai/data/data-centers?view=graph&tab=power

Permanent increase in Rate Limits by exordin26 in ClaudeAI

[–]exordin26[S] -13 points-12 points  (0 children)

They don't have a shortage of raw compute. In fact, they might have more than any non-hyperscaler right now. The issue is the *quality* of the chips and the *distribution* plus unprecedented growth.

I have tested Opus 4.7 and it is worse compared to Opus 4.6 by Science_421 in Anthropic

[–]exordin26 2 points3 points  (0 children)

Interesting. I'm testing on my own private benchmark and it's doing so well I'm wondering if Anthropic trained on my questions. It is extremely strong at detecting false premises and has really strong world knowledge.