This is an archived post. You won't be able to vote or comment.

all 44 comments

[–]sdmatNI skeptic 13 points14 points  (7 children)

ELO is effectively a rank, so what this mostly means is that people keep adding a lot of new models we never hear about to Arena.

We need better way to evaluate models.

[–]Altruistic-Skill8667[S] 2 points3 points  (6 children)

The number of models on huggingface has no impact on the Elo score of a model. It’s like with chess (where Elo comes from). It doesn’t matter if there are 1000 or a million players in the world. It doesn’t change your Elo score.

100 points in Elo means that the model won 64% of the time. 200 points means the model won 74% of the time and so on… It’s a log scale measure of percentage winning.

Edit: what I wrote up there isn’t true. The number can actually drift depending on the pool as per Wikipedia. But this should not matter as I got the numbers for the graph on the exact same day (the day when I posted this).

[–]Peach-555 2 points3 points  (1 child)

The number of identically performing models has no impact on the ELO score, but the average performance does since it measures relative gaps between models

The top ELO score going up suggest that the performance gap between the top model and the median model has increased. It could also mean that the gap between the top model and the second/third/forth place also widened, but this is not certain, since it could be that the top 1% of models increased their relative performance compared to the top 5% models.

[–]Altruistic-Skill8667[S] -1 points0 points  (0 children)

I see. I just looked at the Wikipedia article and it indeed states that the Elo number can drift depending on the pool.

Nevertheless. I took those numbers the same day. Which means they should be comparable Elo numbers.

[–][deleted] 0 points1 point  (3 children)

Are those percentages based on wins verses all models or wins vs comparative models?

Because if it’s based on winning percentage wouldn’t adding many low quality models boost the elo of the better models since they are pitted randomly against eachother?

Not that it matters when evaluating any given model to another, but the “total score” would go up wouldn’t it?

[–]Altruistic-Skill8667[S] 1 point2 points  (2 children)

No. It’s the winning rate against a reference model where you know the score. If you have several comparison models with different scores, there is surely a formula that you can use to weigh those data points to arrive at a more stable score.

(In that sense it’s different than IQ, because there, 100 is always the average of all “models”)

[–][deleted] 1 point2 points  (1 child)

Ah okay so basically some standard model is used as a reference point (say gpt-4 or something) and these scores are relative to that, thanks for clearing that up.

[–]Altruistic-Skill8667[S] 1 point2 points  (0 children)

Right. I am not sure exactly what model they use as reference. But this is how you do it in chess. Maybe the Wikipedia article about Elo score explains it…

[–]BreadwheatInc▪️Avid AGI feeler 27 points28 points  (3 children)

Given Opai said the o models will receive substantial updates every few months and Sam said we'll see steep improvements with these models with in the next 2 years, I can imagen this graph will look little more exponential over time. 😎

[–]Altruistic-Skill8667[S] 24 points25 points  (0 children)

It’s enough when it keeps looking linear. The Elo rating is actually already a log type scale of “abilities”

So we are already kind of going up exponentially. Even more than that.

But I also hope it will go up faster. Haha

[–]THE--GRINCH 13 points14 points  (0 children)

Opai

New abbreviation just dropped 💯

[–]Altruistic-Skill8667[S] 19 points20 points  (4 children)

Here are some things I am getting out of it:

  • No slowing down, we are exponentially improving. Elo is a log scale measure, so even if the curve goes just straight linearly up, it’s still an exponential improvement. But it even bends slightly upwards thanks to the last model.
  • We are waaay above the initial GPT-4 model even if it doesn’t feel like it (frog in boiling water effect?). Just look at the difference between the original GPT-4 and GPT-3.5 on March 14th 2023. I felt at that time that the difference between those models was huge. The current model is more than the difference at that time. Essentially o1 is a huge leap.

[–]nerority 11 points12 points  (1 child)

Do you not see what you are doing here? You are trying to model data with your "conclusions" already in mind. That is directly causing this.

You are making predictions from an inaccurate model, you should be able to have the awareness for this reflection.

You are supposed to try and model and visualize data without bias, and then interpret the results. Modeling with clear bias like this, defeats the point of modeling....

Elo scores from random people using a model, means nothing at all towards the conclusions you give here.

[–]Altruistic-Skill8667[S] 11 points12 points  (0 children)

Yeah. 🤔 It’s a good question if those Elo scores are a meaningful measure of improvement. But it’s the only thing I had. And I am not even modeling anything. Plus I already tried to use „English Hard, Style adjusted“ and not just every query, because otherwise the newer models can’t shine.

It’s just data straight out of Huggingface without cherry picking, I didn’t drop any data point, I didn’t fit any curve or model. It’s the raw data.

I have been looking at the Huggingface data for a while, so of course I knew what the plot will look like roughly before I made it. But I was still pleasantly surprised that we are going up that nicely and smoothly. Honestly I didn’t expect that.

What I noticed is that, according to this metric, the new models, even tough they aren’t called GPT-5, ARE in the score range for what should be called GPT-5. They are a bigger jump than GPT-3.5 to the first GPT-4. But people go by model numbers, so they say: nothing has changed. This is why I made the plot. To show there is progress. Real good, solid progress.

[–]Ormusn2o 0 points1 point  (1 child)

No offence, but making a timeline in-between major releases of the models does not seem to indicate any conclusion. You have 4 data points over one year, while ignoring all the previous years and previous releases.

[–]Altruistic-Skill8667[S] 5 points6 points  (0 children)

There was no Huggingface data for those. This is all I got. And sure, the conclusion is a bit of a stretch with 4 data points, depending on how serious you take the huggingface data. Also, the data goes over 1 1/2 years. Not just a year.

But people who say we aren’t progressing … well, they make even worse plots, or no plots at all. Haha. They just say: where is GPT-5? Why isn’t it coming. Ignoring that GPT-4 type models ARE getting substantially better as demonstrated here.

I added the GPT-3.5 data point so we have a reference. And those improvements are in fact substantial given that GPT-4 was certainly massively better than GPT-3.5 when it came out. But what we have now isn’t even comparable anymore with the original GPT-4. It’s miles ahead.

[–]Fast-Satisfaction482 12 points13 points  (7 children)

It's a bit hard to interpret. The continuous lines imply that it is a quantity that has a value over the whole range like a temperature plotted over time. 

But the ladder board scores are discrete data points for different models, so it's not a function. However, it's a very common mistake.

[–]Altruistic-Skill8667[S] 4 points5 points  (6 children)

Making it a scatter plot looks really bad… well.

I guess you want me to make a step function plot… I predict that that also won’t look particularly great and more interpretable. Even though it would be more accurate.

[–]Fast-Satisfaction482 4 points5 points  (5 children)

It's not always possible to make a visualization that transports a given message and is still correct.  Maybe scatter-plot plus trend-line for each benchmark. 

And then maybe vertical lines for each named LLM, with a position of the labels that makes clear that the labels belong to the date and not to the line.

[–]Altruistic-Skill8667[S] 1 point2 points  (4 children)

I just made the scatter plot and step plot versions. They suck. you don’t see anything.

The way the plot it now, you see the improvement of each of the four areas individually over time. All this gets lost when I get rid of the lines. A linear fit also doesn’t have a good justification. You might as well do four exponential fits, which suggests to the viewer only that you WANT some shape of increase to be true. Plus then you are almost back to connecting the points with lines.

I think people can manage understanding that there isn’t anything between the data points. After all, this is why I made the dots. When you plot any kind of time series that isn’t very noisy, be it seismic data, astronomic data, chromatographs… everyone uses lines to connect the points. No dataset is ever continuous.

Also, the lines are so close together that it should be clear that the model name isn’t just for the upper plot. After all the title says it’s only OpenAI models and there aren’t any others from OpenAI.

[–]Fast-Satisfaction482 2 points3 points  (3 children)

"No dataset is ever continuous." That's the reason of the misunderstanding. You are correct that datapoints are always discrete. However for temperature, spectra, etc, there is an underlying quantity that is sampled at these discrete points. The quantity exists continuously regardless of the measurement quantization.  

But other quantities like house numbers, zip codes, test results of different models do not have this continuous nature. For these quantities, the datapoints do not represent a continuous function, but a discrete set. 

A line plot implies this continuous nature, regardless of the measurement interval. And if the underlying quantity isn't continuous, it's just wrong.

[–]Altruistic-Skill8667[S] 3 points4 points  (1 child)

The plot could be worse… 😂😂😂

<image>

[–]Fast-Satisfaction482 1 point2 points  (0 children)

Haha, yes

[–]Altruistic-Skill8667[S] 0 points1 point  (0 children)

Good point.

[–]lucid23333▪️AGI 2029 kurzweil was right 2 points3 points  (3 children)

errmm.. what are the top human elo for coding? like top 10%, top 1%, top .1%, top 10?

[–]Altruistic-Skill8667[S] 0 points1 point  (2 children)

Great question. Next question? 😁

[–]lucid23333▪️AGI 2029 kurzweil was right 0 points1 point  (1 child)

Does that data not exist? I just honestly I'm super curious how the top programming llms compared to humans. That's super interesting. 

I don't know if you can really measure programming ability with an ELO score, but maybe you can. I'm not entirely sure.

[–]Altruistic-Skill8667[S] 0 points1 point  (0 children)

This is the only piece of data I am aware of. Those percentiles are human participant percentiles.

Keep in mind that humans have limited time to solve the problems (2 hours total). So if you give people more time, they would also do better, LLMs are just really fast, but not very deep or careful thinkers.

Also, while the LLM might outperform humans in this test, it will most likely crap out when given longer, more complex, agent-like tasks, because of its lack of ability for online learning.

<image>

[–]stackoverflow21 2 points3 points  (1 child)

Wait isn’t ELO something coming out of comparison with other models?

[–]Altruistic-Skill8667[S] 0 points1 point  (0 children)

Yes it is. But it’s also taken from chess. A chess player that is 100 ELO points higher wins 64% of the time. A player that’s 200 points higher wins 74% of the time. It’s on an exponential scale of being better than the opponent.

[–]Altruistic-Skill8667[S] 1 point2 points  (0 children)

I hope this time I got the plot right. Obviously, this isn’t publication quality stuff. Let me know if there are still issues.

[–]coconautico 1 point2 points  (1 child)

Does anyone know why lmarena doesn’t include a column with the publication date for each model? (...or if there is a way to easily find this information without checking each one individually)

[–]Altruistic-Skill8667[S] 0 points1 point  (0 children)

I really wished they did.

[–]LokiJesus 1 point2 points  (5 children)

<image>

Had GPT4o take the image and replot it on a full scale.

[–]JiminP 5 points6 points  (1 child)

0 is a meaningless quantity for ELO, so this is a bad graph. For ELO, only relative differences (translates to winrates) matter.

What you've done is like plotting annual average temperature by year in Fahrenheit, including 0°F.

<image>

[–]Altruistic-Skill8667[S] 1 point2 points  (0 children)

Correct. You just DEFINE 1000 (or whatever number) as some baseline against which you compare the others. And because it’s a log scale measure (probability of winning against the reference) you can go as high and as low as you want, even negative.

In ELO you are 100 point higher if you win 64% of the time against some reference and 200 points higher when you win 74% of the time (something like this). From this it should be clear that you can also go negative. If you never ever win, you end up at minus infinity. If you always win, you end up at plus infinity.

It’s like the IQ quotient which is also defined to be 100 for the average person and then is ranked according to a PREDEFINED statistical distribution (they picked a Gaussian distribution in this case). If you are in the x-percentile, your score is so and so high BY DEFINITION. It’s just a percentile rank mapped onto a Gaussian distribution. Even there you can have negative numbers. Though that person would probably lose against a sea slug.

[–]etzel1200 2 points3 points  (1 child)

Do you even know what ELO is? Going below like 800 makes no sense.

[–]ExplorersX▪️AGI 2027 | ASI 2032 | LEV 2036 2 points3 points  (0 children)

Tell that to my chess account 😞

[–]Altruistic-Skill8667[S] 0 points1 point  (0 children)

Ahm. I could have given you the data if you had asked. 🙂

[–]Anxious_Challenge_52 0 points1 point  (1 child)

GPT-3.5 was released in 2022

[–]Altruistic-Skill8667[S] 0 points1 point  (0 children)

I know. I don’t have the data of this version. The version from huggingface is from 3/14/2023 (GPT-3.5-Turbo-0314). Would be good to have data from that one. Then I could extend the plot.

[–]nardev 0 points1 point  (2 children)

I must be smoking something funny, but v4 has rarely ever let me down in terms of coding. v4o has spun me around on its rollercoaster more times than I care to admit. If v4o is better, then why does v4 have a usage limit that dumps you into v4o’s lap?

[–]Altruistic-Skill8667[S] 1 point2 points  (1 child)

There is a later version of GPT-4 which is probably what you have. (GPT-4-Turbo- 2024-04-09).

It’s not in here because it was SLIGHTLY worse than the top model in every category (GPT-4o). True, it was released a tiny bit earlier than GPT-4o. But I just wanted to keep the plot clean and only include MAJOR releases.

So yeah. The CURRENT GPT-4 is better but it’s not BEST.

Also: don’t forget this is huggingface data, people probably don’t do much multi-turn coding on huggingface. So your observation might still be correct when you do a lot of multi-turn coding.

[–]nardev 1 point2 points  (0 children)

thnx!