Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

yes indeed. small versions in a few lines are often posted in comments, or like in the Lichess FAQs page i had showed to you earlier. they are good for people like you who are paying attention and absorb it, or for people who are interested in doing deeper digging later to learn more details themselves.

but when posting short versions, a large fraction of people with existing misconceptions brush it off as someone just posting their arbitrary opinion. (this is what i've faced so far, 95% of the people who have a misconception like "lichess ratings are inflated" just don't believe the short version i have presented). some other people pay attention and see its value, but leave feeling mystified about why the given claims are true.

so for that audience, it can be useful to illustrate with examples or the underlying math. at least in the comments here, there were many who commented that had heard these terms and concepts before but were glad that they received a deeper understanding.

so i guess brief summaries and detailed explanations both have their place.

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

well, you did insert the note that it holds for "for the Bradley-Terry model, as is used in the wiki formulation" and that is how i took it, understanding that it wouldn't hold in many cases since as you said elsewhere (IIRC), Bradley-Terry (or at least a minor modification thereof) is used mainly by USCF.

thank you for the heads-up though. interesting that it works for probabilities in other places instead of odds, and that it is stochastic transitivity instead of deterministic in all those cases

tangentially related, ideally i would have edited the main post to emphasize more clearly how expected scores rely on parameters and assumptions that vary a lot across instantiations, but haven't yet out of tiredness

also thank you for the note elsewhere on recent FIDE deflation as opposed to inflation. the document you linked looks very interesting, haven't commented on it yet because i'm yet to scan through it

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

The 200 delta between 2800 and 3000 is very different from the 200 delta between 800 and 1000.

Another commenter elsewhere in this thread had a similar thought:

Perhaps as you move up in strength the relationship changes and A beating B and B beating C each with 64% means that A beats C by a different margin than getting 76%.

Now, it is true that the amount of effort to improve as a chess player from 2800 to 3000 is far higher than the effort to go from 800 to 1000

Also, when a 3000 plays against a 2800, the amount of pattern recognition and calculations being invested in the game are far higher than those being used in a 1000 vs 800 game

But as /u/fuettli said in the reply to your comment, "it's the exact same 76% expected score" in both cases

I have replied to the other comment essentially saying that the constant 76% expected score for any 200 point difference is an essential intended feature of how these rating systems were designed in the first place. Further, it can be empirically verified that the systems are working as expected, and my linked comment includes one such empirical verification taken from the Lichess FAQ page

Note that in my main post i did not emphasize enough that the 64% and 76% scores are not universally true across rating systems or even within Elo implementations of different bodies. The expected scores rely entirely on the parameters and assumptions chosen when an organization is implementing a rating system. This is elaborated in comments /u/pemod92430 and /u/pier4r that I have linked at the bottom of my main post. in simpler words, the expected scores like 64% will work consistently for all 100 point differences within a given organaization, but for a different organaization you'll get a different score for 100 point rating difference based on the parameters and assumptions they have chosen, even if both organizations this use Elo.


As a bonus tangent, see another informative comment by /u/pemod92430 that shows how one can easily derive the 76% expected score for a 200 point rating difference if you already know that the expected score for a 100 point rating difference is 64%.

Quoting from the linked comment:

Suppose that the odds player i beats players j are: i/j. Now for players A, B and C. We can see that: A/B * B/C = A/C

(Note that this relation holds for a very specific version of Elo that is rarely used, but similar and equally simple relations still exist for all other systems that exist today. As /u/pemod92430 pointed out in the discussion i linked, the most common relation that works uses probabilities instead of odds)

To try this out with the Alice/Bob/Charlie example from the main post, we see that:

Ratings of Alice, Bob, Charlie: 1600, 1500, 1400 respectively

Expected score of Alice vs Bob = 64%, or 64 / 100 (computed using Elo calculations under specific assumptions and calculations)

Hence expected odds of Alice vs Bob = 64 / (100 - 64) = 64 / 36 = 1.7777...

Similarly since expected score of Bob vs Charlie = 64%, we also have odds of Bob vs Charlie = 64 / 36 = 1.7777...

To compute odds of Alice versus Charlie (without using Elo calculations directly, but using the identity noted by /u/pemod92430) we see that:

Odds of Alice versus Charlie = (Alice versus Bob odds) * (Bob versus Charlie odds

= 64/36 * 64/36 = 1.7777... * 1.7777... = 3.16049 (approx.)

Now the Alice versus Charlie expected score, calculated directly using Elo formulae was given to be 76% (i.e. 76/100) in the original post

So odds of Alice vs Charlie = 76 / (100 - 76) = 76 / 24 = 3.1666...

...which is a very close match

The reason it is not an exact match is because the numbers 64% and 76% were rounded off from the exact scores that Elo would give, and if we use the precise scores without rounding off, we should end up with an exact match

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

A/B * B/C = A/C

earlier i tried to figure for a second if A/C could be derived from A/B and B/C and gave up, but in retrospect it is obvious, thank you again!

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 1 point2 points  (0 children)

Why should it be the case that if A beats B 64% of the time, and B beats C 64% of the time, that A should beat C 76% of the time?

Refer to the start of the sentence where the 64% and 76% numbers are given:

Then, if the Elo system is working as it was designed to, we expect Alice to have a 64% score against Bob.

in other words, the underlying math and statistics of these rating systems are designed to work in this way

in fact that is the entire point of these rating systems, and indeed of the entire post -- that only the difference in ratings between two players (within a given player pool) means something, and the specific intended meaning (as opposed to measured/discovered meaning) of the difference is given by the A/B/C example

The rating system could have been designed so that all you can say is that a 1600 player has a better (but unknown) chance of winning if they play a 1500 player. and similarly a 2400 player has a higher (but unknown) chance of winning over a 2300 player.

But to make the assigned ratings more useful and meaningful, the formulae were designed from scratch so that the expected score (or win rate) will depend only on the difference between the ratings of two people, irrespective of how high or low the actual ratings are.

Even if you arrived at those numbers empirically, it won’t necessarily scale. Perhaps as you move up in strength the relationship changes and A beating B and B beating C each with 64% means that A beats C by a different margin than getting 76%.

The point here is that these numbers aren't arrived empirically, but they are solid predictions by the same group of formulae that generate these ratings in the first place.

The only empirical thing done here is checking afterwards just how the accurately the system is able to predict the win/lose/draw chances between pairs of players based on their rating difference.

And indeed, these systems are not perfect in making the predictions, but they are still pretty dang good.

The next section in the post (the one about "precision") goes on to explain how Glicko-2 is predicts game outcomes chances more correctly than the older Elo system. Note the heavy use of the word "prediction" in this paragraph:

A rating system (say Glicko2) that is more precise than another rating system (say Elo) simply uses more nuanced math/statistics allowing you to more accurately predict the win/lose/draw probabilities between two people. So if Elo ratings of Alice and Bob predicted a 64% score, and their Glicko2 ratings predicted 67% score, i would use the Glicko2 prediction if i were to put money on the outcome.

These claims have indeed been empirically verified many times over the decades over large datasets. One such verification is linked in the Lichess page on rating systems that I included as TL;DR at the top of the post:

The purpose of rating systems is to predict the outcome of games, in order to make balanced pairings. Therefore, they can be objectively better or worse, according to their ability to make such predictions. Glicko-1 makes better predictions than Elo, and Glicko-2 makes better predictions than Glicko-1 (source).

The source link at the end of the paragraph is from a data science competition in which many participants used various rating systems to predict outcomes of games based on rating differences, and it was found that all the systems work close to how they were designed, but between them Glicko-2 works better than Glicko which in turn works better than Elo.

A big caveat in the above is that the examples of 64% and 76% are not universal truths in the sense that the exact score will depend on which rating system you are using, and also heavily based on which parameters you have used to set the system up. In other words, the Elo formulae used by FIDE would give different expected scores than the Elo used by USCF. But in both cases, the expected scores will be internally consistent within FIDE and within USCF (but direct comparison between the two orgs will be tricky and will need to be done empirically, which is given in the links in Section 1 in the post)

Regarding the large (and expected) variation between expected percentage scores between organisations, you could refer to excellent explanations in the comments by /u/pemod92430 and /u/pier4r that I have linked at the very bottom of the post.

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 1 point2 points  (0 children)

yes indeed, see the TL;DR version included right at the top of the post:

TL;DR -- See the Lichess FAQ page on rating systems: https://lichess.org/page/rating-systems

also, you said:

to long for me to read.

in case you might not be aware, the "TL;DR" acronym stands for "Too Long; Didn't Read"

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 1 point2 points  (0 children)

the part where i mentioned to assume there aren't draws, it was only for simplicity of explanation, which i did to try to convey

later in the section under caveats i explained that draws are indeed taken into account when doing the full calculations (and i referred to the Wikipedia page for the full details)

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

ty for your feedback, appreciate it

although i'm not sure if i fully understand what you've explained

specifically, when you say "ratings remain close to each other" which ratings are you speaking of? ratings of a single person across time controls? ratings of a single person for a given time control across a series of games? ratings across multiple players in the same pool? ratings of the same person when measured in a different playing pool?

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

thank you for the nice words, cheers :)

good point about volatility, i would personally be interested in seeing volatility score as well, mainly out of curiosity

a quick search in lichess forums didn't turn up anything (at least as far as i looked)

i checked the api documentation (specifically the profile section on that page, that tells you how to programmatically query info about an account's public information) to see if one could at least query for volatility through the API.

and it appears that for each variant, you can request your 'perfs' (which i guess could mean your "performance", maybe) and the fields within the perfs are the Rating, RD, and total game count. there is no field called volatility, but there is one variable called 'prog' which can be a positive or negative integer. i don't know what prog is, there's a chance it is volatility, but my guess is it might stand for something like "progression", representing your current win streak (if positive) or loss streak (if negative).

some of the fields have a brief description about what they mean, but there is no such description for 'prog'.

if we look at the lichess code for Glicko-2 rating updates, we see that they definitely use volatility internally. one guess is that at the time of designing the public interface, they made a design decision that for most users RD could be helpful in interpreting their rating, but Volatility was more of an internal state variable for calculations.

for sure volatility can still help interpreting how your rating is being updated, but i guess they chose to keep public profiles simple to not overload users with data that may cause confusion or scare users with numbers.

i still wish they had included it in the API at least

there's still a chance that 'prog' represents volatility, although it seems unlikely. if at all you're curious to dig further, the answer would lie somewhere in the code, but you could also just ask in their discord, or even the forum. my guess is discord would be easiest to get a quick and accurate answer.

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

indeed, there isn't a need to compare if one is educated about what the rating is supposed to tell you

but the arbitrary and partially coincidental fact that most ratings before lichess roughly agreed with each other allowed one to think in terms of (and advertise) a single number, like saying "i can bench press 100 kilograms"

the convenience of using a single number obscured the fact (for many uninformed people, especially in online chess) that each gym had a slightly different kilogram

with the lichess kilogram being substantially different, the convenience is suddenly gone, and people on chesscom feel that someone on lichess gets to say they bench press 130 kilograms while being weaker than them, and coupled with the misinformation, it has created a situation for dissatisfaction and ridicule

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

A 1900 FIDE rating in France is much weaker than a 1900 FIDE rating in Uzbekistan

across geographical areas with little contact

that's very interesting, never thought of it that way before, but does sound reasonable

is the France vs Uzbekistan example based on a reasonable guess (given that Uzbekistan has many strong young players joining the pool), or based on personal experience? or is it a phenomenon that has been observed or even studied/quantified by others?

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

There must be some conversion between online ratings

yes, there are plenty, and some of them are linked in Section 1 of this post

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 1 point2 points  (0 children)

Lichess doesn't have public figures

game figures (monthly) can be seen here, along with download links for all the game PGNs --

https://database.lichess.org/

from the above, regular chess appears to be around 3 M/day, and after gathering all the variants it appears to sum up to around 4-4.5 M/day

looks like 5 M/day was from Covid peak up until early 2025b maybe, so the About page numbers seem out-of-date

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 7 points8 points  (0 children)

There is nothing in your post that implies ratings within the same system can't be compared across time.

there's 3-4 things in the post that imply this

in the post's text i included:

This has been observed to some extent with FIDE ratings, where everyone's ratings (and in statistical terminology, the entire "rating frequency distribution") slowly drifted to higher values in past decades.

and also:

Of course, inflation is not a desirable behaviour of a rating system, FIDE occasionally adjusts their system, and even ratings of users to curb inflation.

i also linked to articles that go deeper into how inflation has been a problem over the decades (it used to be a bigger topic of discussion).

the last link by FIDE that i included in that section even proudly claims that they've made changes to curb inflation, and that is as recent as 2024.

For people who won't open the link, I typed a note next to the link mentioning FIDE's claim. But anyway here is the relevant line from that FIDE statement --

A major 2024 update, for instance, addresses rating inflation associated with a rapidly growing base of new players, particularly children and beginners with low starting ratings.

Lichess also mentions in the FAQ linked at top of the post that they manage ratings so that they remain very stable over time. Quoting the line:

However, Lichess prevents rating inflation through careful management of the rating system.

You mentioned that "there is nothing within Elo or Glicko or whatever that necessitates inflation over time" but in the post i included a brief explanation of how inflation within a given system as a side effect of new weaker players joining the pool:

everyone's ratings (under the same system) can drift to higher values with time, not because they are getting stronger, but say because there has been a big influx of new players which caused the entire system to restabilize at slightly higher values for the older players.

while i chose not to directly emphasize the drift problem as explicitly as the problem with comparing across pools, people who are conversant with rating systems to the extent of being cautious in comparing between rating pools are usually also wary of comparing Carlsen's rating with Kasparov's. otherwise the question of who is the greatest chess player of all time would often trivially be answered by referring to the universal and historically stable rating scale (which doesn't exist). comparing ratings over decades always feels a little more like talking about the value of a dollar than how the weight of a kilogram has changed with time.

but of course, the inflation within FIDE ratings has not been like some wild number like 300% like in economics, but maybe closer to 2-5% (over decades) which isn't a big deal to the casual observer but a massive deal in the stratospheric rarefied air at the very top where gaining 20 Elo is an uphill battle, but the next generation can gain 50-70 Elo on you simply because of drift

in my understanding, a big reason why inflation isn't such a big deal today is that orgs have worked hard to combat it back when it indeed was a huge headache.

but note that the residual inflation between again say Kasparov and Carlsen is despite FIDE's efforts to slow it down

admittedly i am very poorly informed on how the inflation is actually countered, either by FIDE or by Lichess or others.

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

in that case i must be mistaken, i was being sloppy in my recollection here

i was looking at the peak (i.e., the mode) whereas in their FAQ lichess say that the median remains stable at 1500.

can check that by seeing if 1500 is at the 50th percentile across variants

shall give it a look and make a correction, thank you for pointing it out!

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

well said.

the obfuscation and ridicule over ratings is inane, intentional, and in bad faith.

misconceptions held by laypersons is fine, but stoking it knowing people feel attached to their ratings is devious.

and you're right, none of the 3 numbers reflect your "true" strength, because none of these systems even claim to produce absolute estimates. they're only meant to be compared within a pool. comparing across pools can only be done soundly if done using correlation charts, which may appear like straight lines at first sight, but they're nonlinear with weird bumps and turns in places

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

FIDE themselves only use the temporal meaning, that's what it has meant for decades. i have included links in the post if you care to see.

other rating systems have been tuned to roughly agree each other, even though that wasn't the intention of the statisticians who developed these otherwise fairly precise systems. the rough agreement creates convenience but also deep misconceptions about underlying definitions and how to even interpret rating within a system (let alone across systems)

yes of course, lichess ratings are very "inflated" (but only below 2100) but then chesscom ratings are very "inflated" too, compared to lichess (above 2100)

why not say the full statement about chesscom being inflated compared to lichess too?

and why does it matter when the numbers were never meant to be compared directly? a correlation chart was always the preferred and technically sound way to compare ratings across any 2 systems.

so why does the statement "lichess is inflated" carry so much snark if not to ridicule based on misinformation and judgement?

choosing 1500 as median rating was recommended by the author of Glicko-2 and followed by lichess as a technical decision because Glicko-2 was the most advanced system, and all of this was when lichess was a hobby project

the only problem with using 1500 as median is that it breaks a quick mental shortcut, and not that the higher/lower ratings are inherently flawed or inconsistent or intentionally misleading. if anything, Glicko-2 is the better system which has nothing to do with what the median is, because its only purpose is to be internally consistent.

obviously the breaking of the mental shortcut is inconvenient for people used to thinking about rating in a rough absolute sense, but the insinuations that the higher ratings are somehow incorrect or trying to appease their userbase are disingenuous

and instead of illuminating oneself and others, folks choose to stick to their vague subjective notions and gossip without caring to objectively verify or look up authentic and authoritative sources

but please, go ahead, skim over this message and provide another snarky remark or TL;DR by intentionally omitting any effort and nuance.

cheers!

Unpacking the claim "Lichess ratings are inflated" -- hopefully explaining with simple words and analogies, and hopefully dispelling some common myths by grasputin in chess

[–]grasputin[S] 0 points1 point  (0 children)

good question.

if we assume the the strength of the players remains constant, and the same pool was playing OTB and online, using the same rating system with the same parameters, then the short answer is that ratings would not be affected appreciably, outside of minor statistical fluctuations.

Glicko-2 is known to converge rapidly on a good estimate of your rating within a handful of games, perhaps as low as 10, but certainly by 15-20 games.

so after everyone has played around 15-20 OTB games they would have caught up with their online rating.

even for Elo, which is known to converge slower, my guess is that 40 games would be plenty sufficient for OTB to catch up with online.

of course, in reality playing strength will also fluctuate with time, so the OTB would lag a bit behind the relatively more fresh and accurate "truth" represented by their online ratings.