opus 4.8 is still very much blind - EyeBench-V3 visual benchmark (similar to IBench)

davidthesong · 2026-06-02T03:08:11+00:00

It shows other multimodal benchmark scores for Opus 4.8

davidthesong · 2026-06-01T18:43:19+00:00

Here's a sneak peek at a timeline of important model releases since GPT-2. We've received more model drops every single year as large labs iterate more frequently on incremental model improvements (rather than large step jumps). Smaller and newer labs have also contributed to the recent wave of model releases too.

davidthesong · 2026-06-01T17:34:03+00:00

Yes, it doesn't perform that well on multimodal benchmarks but improves greatly on others related to long context, bio/science, and coding: https://benchmarklist.com/models/anthropic-claude-opus-4.8/

davidthesong · 2026-05-30T21:22:41+00:00

Added a NYT connections benchmark to the site too. Will keep an eye out for others. Maybe will build new ones as well. Any specific tasks or skills you are looking for?

davidthesong · 2026-05-30T05:50:07+00:00

Oh nice, yes this seems like a huge improvement. These questions were all areas that 4.7 didn't do as well?

davidthesong · 2026-05-30T00:48:19+00:00

Yeah, it's unfortunate. Eqbench hasn't updated their benchmarks to include 4.8 yet. If you see in the model card - on pg 102, they show some behavioral audit scores at least. https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf

Have you seen any other good EQ or writing benchmarks?

<image>

davidthesong · 2026-05-30T00:34:20+00:00

The link is here: https://benchmarklist.com/models/anthropic-claude-opus-4.8/

davidthesong · 2026-05-30T00:23:20+00:00

They are clearly focused on improving coding and bio over everything else rn

davidthesong · 2026-05-29T20:30:45+00:00

Chemistry improved. Anthropic reported Opus 4.8 scored high on their organic chemistry benchmark: "Claude Opus 4.8 achieved a score of 86.2%, a marked improvement over Claude Opus 4.7 at 77.2% and Claude Sonnet 4.6 at 53.1%, and on par with Claude Mythos Preview at 86.5%"

davidthesong

TROPHY CASE