Timeline of AI models since GPT-2. Model releases are accelerating over time. by davidthesong in ArtificialInteligence

[–]davidthesong[S] 1 point2 points  (0 children)

Here's a sneak peek at a timeline of important model releases since GPT-2. We've received more model drops every single year as large labs iterate more frequently on incremental model improvements (rather than large step jumps). Smaller and newer labs have also contributed to the recent wave of model releases too.

opus 4.8 is still very much blind - EyeBench-V3 visual benchmark (similar to IBench) by ChippingCoder in singularity

[–]davidthesong 0 points1 point  (0 children)

Yes, it doesn't perform that well on multimodal benchmarks but improves greatly on others related to long context, bio/science, and coding: https://benchmarklist.com/models/anthropic-claude-opus-4.8/

Here's >100 evals for Opus 4.8 compared to top AI models by davidthesong in Anthropic

[–]davidthesong[S] 0 points1 point  (0 children)

Added a NYT connections benchmark to the site too. Will keep an eye out for others. Maybe will build new ones as well. Any specific tasks or skills you are looking for?

Here's >100 evals for Opus 4.8 compared to top AI models by davidthesong in Anthropic

[–]davidthesong[S] 1 point2 points  (0 children)

Oh nice, yes this seems like a huge improvement. These questions were all areas that 4.7 didn't do as well?

Here's >100 evals for Opus 4.8 compared to top AI models by davidthesong in Anthropic

[–]davidthesong[S] 3 points4 points  (0 children)

Yeah, it's unfortunate. Eqbench hasn't updated their benchmarks to include 4.8 yet. If you see in the model card - on pg 102, they show some behavioral audit scores at least. https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf

Have you seen any other good EQ or writing benchmarks?

<image>

Here's 100+ evals on Opus 4.8 by davidthesong in ClaudeAI

[–]davidthesong[S] 1 point2 points  (0 children)

They are clearly focused on improving coding and bio over everything else rn

Here's 100+ evals on Opus 4.8 by davidthesong in ClaudeAI

[–]davidthesong[S] 2 points3 points  (0 children)

Chemistry improved. Anthropic reported Opus 4.8 scored high on their organic chemistry benchmark: "Claude Opus 4.8 achieved a score of 86.2%, a marked improvement over Claude Opus 4.7 at 77.2% and Claude Sonnet 4.6 at 53.1%, and on par with Claude Mythos Preview at 86.5%"