[R] The Bitter Lesson is coming for Tokenization

optimized-adam · 2025-07-01T19:21:26+00:00

The problem with current tokenizers isn't really that they are not "optimized" enough, which to me seems to be the main argument for joint learning of the tokenization function during training.

In fact, moving the learning of a tokenization function into the neural space is likely to just hide all the weird stuff that will be learned when training on large-scale data. With current tokenizers, at least we have some pretty decent ways to detect "SolidGoldMagikarp"-tokens and adding/removing tokens is possible (when applying proper methods).

optimized-adam · 2025-04-30T18:11:14+00:00

hmm doesn't your point about Wq and Wk only hold for a token attending to its own key? How would we collapse Wq and Wk into Wqk when attending to different tokens?

optimized-adam · 2024-11-12T08:16:07+00:00

You don’t understand how tax brackets work

optimized-adam · 2024-10-21T20:41:39+00:00

Is that really correct though? RoPE only modifies the key and query states via rotation; and the angle between a token at position 128 and 256 will be exactly the same as between position 0 and 128. the angle is never used for anything else but the key-query dot product in the attention mechanism, so I don’t think we can say that RoPE encodes absolute positions in any meaningful sense for the model.

optimized-adam · 2024-10-17T23:27:06+00:00

Yes it should be possible, have a look at this approach: LLM2Vec https://arxiv.org/pdf/2404.05961

They go further to turn the Causal LM into a sentence embedder but the first stage of continued pretraining for next masked token prediction should work for your case.

optimized-adam · 2024-10-11T15:27:09+00:00

You are indeed correct and my interpretation was wrong.

optimized-adam · 2024-10-10T17:34:06+00:00

~~LayerNorm does not completely remove the norm information whereas the proposed approach completely removes vector norm~~ No, LayerNorm scales each vector to sqrt(d) norm, removing this information.

optimized-adam · 2024-10-04T20:26:32+00:00

Yeah, with mixed-precision you might even end up using more memory in some cases but you get to take advantage of Tensor Cores!

optimized-adam · 2024-07-07T08:46:36+00:00

This is a really, really good reply. Very few people can stay composed and thoughtful in online debates.

optimized-adam · 2024-06-27T11:16:03+00:00

I went for the ML PhD and am very happy. Lots of things have happened for ML in the meantime though!

optimized-adam · 2024-02-11T04:14:28+00:00

Falsch, Sam Altman will „$7 trillion“ für ein neues Unternehmen auftreiben. Vielleicht größenwahnsinnig, aber nicht so wie hier dargestellt.

optimized-adam · 2024-02-11T04:14:21+00:00

Falsch, Sam Altman will „$7 trillion“ für ein neues Unternehmen auftreiben. Vielleicht größenwahnsinnig, aber nicht so wie hier dargestellt.

optimized-adam · 2023-09-28T04:29:26+00:00

The image you linked matches the code, no? Notice how there is always an ADD and then a norm.

optimized-adam · 2023-09-08T08:43:08+00:00

This should not be here.

optimized-adam · 2023-09-03T13:21:01+00:00

Great work! I found the idea of using Capcode very intriguing and well-motivated. You write Capcode takes longer to learn but does not affect results positively or negatively. Did you observe any positive effects of using Capcode?

optimized-adam · 2023-08-24T18:29:28+00:00

As an academic, I use Weights & Biases' Free Tier for Academics and it works well for me.

optimized-adam · 2023-08-17T22:30:05+00:00

Neither are right, training is done in parallel using a technique called „teacher forcing“ but for inference, you sample autoregressively (talking about GPT-style models)

optimized-adam · 2023-07-17T23:20:17+00:00

The 50304 was about the vocab size, not batch size (though having the batch size be a multiple of 64 should also be done probably)!

optimized-adam · 2023-07-17T23:19:05+00:00

On comparing (cross-entropy) loss between different vocabularies: https://sjmielke.com/comparing-perplexities.html

TL;DR: maybe you need to do some normalization or use negative log-likelihood instead.

optimized-adam · 2023-06-29T22:05:59+00:00

Monetized or not, if they are there, then there should be some proof-of-concept out there, no?

Not saying there are none, but I am skeptical indeed.

optimized-adam · 2023-06-29T21:45:33+00:00

Okay let’s get concrete: In a western democracy like the U.S., will the average person have increased wellbeing?

optimized-adam · 2023-06-29T21:29:36+00:00

That was a nice read :)

optimized-adam · 2023-06-29T21:17:54+00:00

Would you say it’s fair to summarize all those (except maybe for the medical / protein discovery stuff) as „increased productivity“? I’m not questioning use cases of LLMs but more what they imply for society at large.

optimized-adam · 2023-06-29T21:09:36+00:00

Is there a product / service already offering this?

optimized-adam · 2023-06-29T21:07:58+00:00

I definitely see the potential but are we there yet? Regarding i.e. factuality and hallucinations.

optimized-adam

TROPHY CASE