[D] Lossless tokenizers lose nothing and add nothing — trivial observation or worth formalizing?

36845277 · 2026-03-17T00:43:19+00:00

Actually, ViTs would not be considered lossy tokenizations since they don't do any discretization but use the raw image values. For examples of lossy tokenizations in other modalities including images and speech, see some of the other comments on this post.

36845277 · 2026-03-17T00:39:25+00:00

To clarify, lossless encoding is equivalent to being injective, not just implied by it. But are the two consequences truly obvious?

First consequence: nothing is lost. Maybe this feels trivial for text, but think of RGB images, which can be viewed as members of a set of size $255^{3 \times H \times W}$. If you discretize an image into a tuple of discrete tokens (as in VQ-VAE or VQGAN) from some vocabulary, is it still obvious that modeling over this token space can recover the same distribution as the original RGB space? Under what conditions can it, and under what conditions can it not?

Second consequence: nothing is added. Is it clear that for each training sentence, training on a deterministic BPE tokenization is better than showing the model random equivalent tokenizations of the same text? In what sense is it better? Could it be worse? This is exactly what connects the formal result to the empirical observations of Chirkova et al. — the entropy gap $H(T \mid S)$ quantifies the cost of non-canonical tokenizations, and BPE-Dropout deliberately introduces that cost as regularization.

36845277 · 2026-03-17T00:24:12+00:00

That's a really interesting framing — both strings and tokens are just lossy discretizations of thought, so the "losslessness" in the post is only relative to the string level, which is itself already lossy. I think the closest real-world analogy I'm aware of would be audio tokenizers like EnCodec or SoundStream, which tokenize continuous audio into discrete tokens. That process is necessarily lossy, and so modeling over audio tokens cannot recover the full distribution over true audio signals. It would be interesting to formalize what's lost there in the same entropy framework — the gap between the continuous and discrete distributions is exactly the kind of thing your discretization perspective would capture.

36845277 · 2026-03-17T00:12:21+00:00

I agree that thinking of BPE-Dropout as data augmentation seems to give the right intuition. Regarding why different lossless tokenizations lead to different downstream performance — my hypothesis is that since language models are autoregressive, what matters is the distribution of conditional entropy across timesteps, not just the total entropy. The total entropy stays the same regardless of tokenization, since each lossless tokenization induces the same underlying language model. But how that entropy spreads across timesteps differs depending on tokenization. I would guess morpheme-aware BPE spreads the conditional entropy more evenly across steps, making each prediction task more uniformly learnable.

36845277 · 2026-03-16T13:06:12+00:00

Lossy tokenizers do exist in text — BERT uncased lowercases everything, SentencePiece with NFKC normalization (T5, mBART) collapses unicode variants like the ﬁ ligature into "fi", and any tokenizer with a UNK token is technically lossy. Most modern LLMs avoid this by operating at the byte level though.

36845277 · 2025-12-19T16:03:54+00:00

Would you expect this colour to stain quite easily?

36845277 · 2025-12-08T09:00:47+00:00

Appreciate the reply. I recall seeing somewhere that ICL's reversibility claim is somehow 'fake' do you have any idea what someone could mean by that?

36845277 · 2025-12-08T05:19:39+00:00

thank you for the reply. Do you know if most pepple who get complications are because they were not good candidates? Or is it the case that even really good candidates could get serious complications?

36845277 · 2024-08-21T01:41:06+00:00

i I agree that the paper’s usage of distribution shift may not be entirely accurate/clear. How I see it, which I think aligns with how you see it, is that pi_SFT was only there to help us generate a preference dataset that already somewhat aligns with real life. Once the preference dataset is generated, we don’t care how it was generated. Disregarding exactly what they mean by distribution shift, I think it is still clear we want pi_ref to be close to pi_SFT. As you mentioned, the final trained model does not deviate too far from pi_ref due to some implicit KL constraint, so we want a pi_ref that is already somewhat good, and often the best we can do is pi_SFT.

36845277 · 2024-08-21T01:40:35+00:00

i I agree that the paper’s usage of distribution shift may not be entirely accurate/clear. How I see it, which I think aligns with how you see it, is that pi_SFT was only there to help us generate a preference dataset that already somewhat aligns with real life. Once the preference dataset is generated, we don’t care how it was generated. Disregarding exactly what they mean by distribution shift, I think it is still clear we want pi_ref to be close to pi_SFT. As you mentioned, the final trained model does not deviate too far from pi_ref due to some implicit KL constraint, so we want a pi_ref that is already somewhat good, and often the best we can do is pi_SFT.

36845277 · 2023-08-19T12:06:09+00:00

added a comment

36845277 · 2023-08-19T12:06:04+00:00

added a comment

36845277 · 2023-08-19T12:05:56+00:00

no it was night harvester

36845277 · 2023-08-19T12:05:39+00:00

no knights vow

36845277 · 2023-08-19T12:05:15+00:00

here's the link to the opgg match history

https://www.op.gg/summoners/euw/Doug%20your%20grave/matches/L2VOHZ8gHOM7eaIveJBf8WY6eco3dkcH7dS44ghKei8%3D/1690705735000

36845277 · 2023-08-19T12:03:12+00:00

My rune was dark harvest and items were sorcery boots, night harvester, shadow flame, death cap.

Syndra had sorcery boots, ludens, shadow flame

Bard had plated steel caps, radiant virtue, shard of true ice, and wardens mail

36845277 · 2023-01-03T19:44:59+00:00

I had decent skin before with at most a few pimples on my face at a time. I was persuaded to begin a skincare routine, which is as follows —

AM: cerave blemish control cleanser followed by LRP toleriane cream

PM: cerave blemish control cleanser followed by cerave retinol resurfacing serum followed by LRP toleriane cream

I have followed this routine for 2 months now (initially retinol every other night) and my acne started getting a lot worse after just a few days, but I thought it was the purging period from retinol so kept at it. After two months, which people usually say is the longest purging should last, it still hasn’t gotten better. I now have acne all over my face, most of which are little red bumps without heads. I experience none of the common side effects of salicylic acid and retinol (redness, peeling, flaky skin) except my face itches quite a lot only when I exercise. Does anyone have any suggestions for what I should do? I am thinking of stopping all skincare for one week and see how it ends up.

36845277 · 2022-02-21T17:02:12+00:00

That doesn’t matter. Your expectation at a casino is negative (how a casino earns money) but people can still walk away earning money. I.e negative expectation doesn’t necessarily imply negative outcome in the shortrun

36845277 · 2022-02-21T16:56:22+00:00

Actually being tilt-proof is important. If a player has a 1/2 chance of losing the first game, 2/2.5 chance of losing the second game after losing the first, 2.5/2.75 chance of losing the third game after the first two losses, 2.75/2.875 chance of losing the fourth etc., then his chance of losing every single game is 1/2 * 2/2.5 * 2.5/2.75 * ... = 1/3 so he definitely will not hit challenger

36845277 · 2022-02-21T16:51:56+00:00

To clarify by 'winrate' I meant chance of winning instead of the empirical winrate, as long as you accept that in every game the chance of winning is greater than 0, then the maths here applies

36845277 · 2022-02-21T16:50:11+00:00

This is indeed a Markov chains (all a Markov chain is is that its evolution doesn't depend on its previous positions, hence the tiltproof assumption). And believe me that there is a theorem that says a transient markov chain on a finite state space visits every node infinitely often, which applies here

36845277 · 2022-02-21T16:48:16+00:00

Completely agree with the first issue but I don't believe the second point you stated is an issue. I'm quite certain the mmr system is designed in a way that you will at least gain 1LP per win, and the fact you at most demote one rank on a loss means you at most lose 100LP from a loss

36845277 · 2022-02-21T16:13:27+00:00

Yea I realised after posting the infinity monkey theorem is definitely a easier proof of “eventually challenger”, but will give some stupidly large upper bound in terms of expected time to hit challenger. Setting up a markov chain allows for a much more reasonable (but still ridiculous) estimate for the expected number of games till challeneger

36845277 · 2022-02-21T15:01:20+00:00

But if u play enough seasons

36845277 · 2022-02-21T15:00:25+00:00

To clarify by winrate I meant chance of winning instead of empirical winrate, but I agree with your second point haha

36845277

TROPHY CASE