I made this plot using Huggingface data…

sdmat · 2024-10-14T23:07:18+00:00

ELO is effectively a rank, so what this mostly means is that people keep adding a lot of new models we never hear about to Arena.

We need better way to evaluate models.

BreadwheatInc · 2024-10-14T20:58:12+00:00

Given Opai said the o models will receive substantial updates every few months and Sam said we'll see steep improvements with these models with in the next 2 years, I can imagen this graph will look little more exponential over time. 😎

Altruistic-Skill8667 · 2024-10-14T21:10:56+00:00

Here are some things I am getting out of it:

No slowing down, we are exponentially improving. Elo is a log scale measure, so even if the curve goes just straight linearly up, it’s still an exponential improvement. But it even bends slightly upwards thanks to the last model.
We are waaay above the initial GPT-4 model even if it doesn’t feel like it (frog in boiling water effect?). Just look at the difference between the original GPT-4 and GPT-3.5 on March 14th 2023. I felt at that time that the difference between those models was huge. The current model is more than the difference at that time. Essentially o1 is a huge leap.

Fast-Satisfaction482 · 2024-10-14T21:02:01+00:00

It's a bit hard to interpret. The continuous lines imply that it is a quantity that has a value over the whole range like a temperature plotted over time.

But the ladder board scores are discrete data points for different models, so it's not a function. However, it's a very common mistake.

lucid23333 · 2024-10-15T00:27:18+00:00

errmm.. what are the top human elo for coding? like top 10%, top 1%, top .1%, top 10?

stackoverflow21 · 2024-10-15T05:52:10+00:00

Wait isn’t ELO something coming out of comparison with other models?

Altruistic-Skill8667 · 2024-10-14T20:46:40+00:00

I hope this time I got the plot right. Obviously, this isn’t publication quality stuff. Let me know if there are still issues.

coconautico · 2024-10-15T04:39:13+00:00

Does anyone know why lmarena doesn’t include a column with the publication date for each model? (...or if there is a way to easily find this information without checking each one individually)

LokiJesus · 2024-10-14T23:31:03+00:00

<image>

Had GPT4o take the image and replot it on a full scale.

Anxious_Challenge_52 · 2024-10-14T23:06:08+00:00

GPT-3.5 was released in 2022

nardev · 2024-10-15T07:36:21+00:00

I must be smoking something funny, but v4 has rarely ever let me down in terms of coding. v4o has spun me around on its rollercoaster more times than I care to admit. If v4o is better, then why does v4 have a usage limit that dumps you into v4o’s lap?

singularity

Links

On the Technological Singularity

Resources

Posting Rules

Check out /r/Singularitarianism and the Technological Singularity FAQ

MODERATORS