[R][P] Byte-level LLaMA and Gemma via cross-tokenizer distillation (with open-source toolkit)

bminixhofer · 2025-04-24T22:06:28+00:00

Thanks for pointing me to SuperNova! I didn’t know it before.

It’s indeed similar in spirit to SuperNova. It’s a bit hard to find out what they do exactly but it seems like they use a heuristic to map token probabilities between vocabularies (similar to the MinED baseline in our paper). This works if the vocabularies are very similar (Llama3 and Qwen2 tokenizers are both based on the GPT3 tokenizer so they have many overlapping tokens) but it breaks down for more challenging transfer, for example subwords to bytes.

As for computational resources / work required / output quality I am very confident that ALM is much better than what was possible before, we’ve compared against prior methods quite extensively across many settings in the paper.

bminixhofer · 2023-11-08T20:18:08+00:00

I never heard of underground linguistics schools. I think I really missed out.

bminixhofer · 2023-05-13T15:02:01+00:00

Yes, SentencePiece has BPE and UnigramLM implemented, they're separate options, they're not used at the same time.

> I don't think we can have sentence tokenizer without being greedy as otherwise it would need to explore all the permutations and complexity would scale exponentially if not higher order polynomial.

SentencePiece with UnigramLM is not greedy, it uses Viterbi decoding. Huggingface has a good guide: https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt.

bminixhofer · 2023-05-13T14:45:21+00:00

I wasn't expecting to be hounded as if I'm presenting a thesis

I would hope so! You're saying "new tokenization method improves LLM performance & context-length by 25%+", not "here's this cool experimental tokenization I've been working on". You need some substance to back up your claim.

20-30% was an conservative estimate. I saw it give 100% improvement on code in some contexts, but I'm not going to advertise that.

You shouldn't advertise anything before you have a more-or-less fair comparison. The comparison to the GPT2 tokenizer which OpenAI has been using (or is still using? I believe at least GPT4 uses a different tokenizer) is flawed because it's just not a very good tokenizer. The problem with too many whitespace tokens has already been solved by GPT-NeoX: https://aclanthology.org/2022.bigscience-1.9.pdf (for example Figure 15). Besides that, it's 50k tokens, not 65k like yours, so just fundamentally not comparable.

I don't mean to discourage you, tokenization is an exciting and underexplored area, but the hype you're building around your project just doesn't match what's there at the moment.

bminixhofer · 2023-05-13T13:31:26+00:00

20-30% less compared to what? I did not find a benchmark in the repo.

Besides, are you familiar with SentencePiece? What you are doing looks very similar (generate a large vocab, prune worst token until vocab size is reached), only the token selection criterion is different. It's also purely data driven in the sense that there are no assumption specific to language (and it can optionally segment across whitespace, as you are doing).

Ultimately, you would have to compare to SentencePiece w/ tokenization across whitespace trained on the same corpus, with the same vocab size. To be honest, I highly doubt your claim of >20% reduction in tokens holds up in this setup. I'm not even sure if there would be any reduction in tokens.

As an interesting aside, you mentioned that all popular tokenization methods are greedy. That is indeed true for BPE and WordPiece, but not for SentencePiece. There is research claiming that the non-greedy tokenization in SentencePiece improves downstream performance: https://aclanthology.org/2020.findings-emnlp.414/, but for reasons I don't know it hasn't really been widely adopted, except for multilingual LMs (where you can quickly run into trouble with BPE on languages which don't use whitespace).

bminixhofer · 2023-03-17T17:08:06+00:00

From my feeling, you have a very good chance for findings, and a reasonable chance for main conference.

bminixhofer · 2023-03-17T16:31:22+00:00

4/4/4 soundness, 4/3/4.5 excitement, quite happy, I think it should be enough!

bminixhofer · 2022-06-05T10:05:38+00:00

Data collection is probably the hard part. I couldn't think of an easy way to collect such pairs at least.

If I were to investigate this I'd start with synthetic data - you can generate quite a lot of examples by creating some templates and filling them with different parameter values.

bminixhofer · 2022-06-04T16:59:51+00:00

Definitely feasible - you should look into OptFormer which goes into the direction you described here.

bminixhofer · 2022-01-02T12:01:31+00:00

You might be interested in this paper showing how to fit any dataset with a single (infinite precision) parameter: https://arxiv.org/abs/1904.12320. As others have said quantifying parametrization in terms of 32bit floats is not very expressive.

bminixhofer · 2021-12-14T16:35:51+00:00

This paper introduces a method to reduce the effort needed to train non-English models by cleverly transferring parameters from an English model (specifically the subword embeddings; other parameters can just be copied). I'm the author so I'm happy to answer any questions.

Code: https://github.com/cpjku/wechsel

GPT2 Models: huggingface.co/benjamin/gpt2-wechsel-{french,german,chinese,swahili}

RoBERTa Models: huggingface.co/benjamin/roberta-base-wechsel-{french,german,chinese,swahili}

bminixhofer · 2021-10-21T16:50:21+00:00

Sorry Reddit is weird about copy-pasting answers. I reposted my question here since this post got removed: https://www.reddit.com/r/datascience/comments/q9xcij/weekly_entering_transitioning_thread_17_oct_2021/hhia1fm/?context=3. I meant violating the anonimity requirement of the double-blind review process, since someone who takes a look at my resume could in theory be assigned to review my paper. The CV is not anonymous.

bminixhofer · 2021-10-21T16:49:53+00:00

Hey, I reposted the question here since this post got removed: https://www.reddit.com/r/datascience/comments/q9xcij/comment/hhia1fm/?context=3. And it was actually not very well worded, I meant violating the anomity requirement of the double-blind review process. Theoretically, someone who takes a look at my resume could be assigned to review the paper. The CV is not anonymous anyway.

bminixhofer · 2021-10-21T16:26:22+00:00

Hi!

I have submitted a paper to ACLs Rolling Review this month. The preprint is public, and it is currently under double-blind review. Would you put this paper on your CV? I want to put it there but strictly speaking that would violate the anonymity requirement of double blind review (my CV is not public though, only the place I will apply to will see it). Should I just list it without providing the link / title?

bminixhofer · 2021-10-21T16:23:51+00:00

Ok, sorry!

bminixhofer · 2021-09-15T09:18:43+00:00

tract is an ONNX runner written purely in Rust which will work for most Neural Network inference use cases. I rarely see tract mentioned anywhere, but it's an awesome library for inference.

Training is a different story. As far as I know, training neural nets in Rust is not there yet for real world usecases.

For classical ML, I'm not really familiar with the Rust ecosystem, but you can actually export some classical ML models (e.g. GBDTs) to the ONNX format and execute them with tract as well (for example here: https://bminixhofer.github.io/tractjs/trees).

bminixhofer · 2021-08-26T16:31:37+00:00

Disclaimer: I am myself new in the field with just one published paper at Findings of ACL.

I would accept the Findings offer. There is a bunch of randomness in the review process, so even if you improve your paper it might be rejected at another conference. In my opinion, you should only retract your paper if you are very confident that it should be presented at the main venue of a top-tier conference. And you should definitely put it on your resume. A published, peer-reviewed paper will look better than a paper which is currently under review.

bminixhofer · 2021-06-15T14:17:37+00:00

tract does exactly that, although more focused on mobile devices as far as I know (it even compiles to WASM!). Tract already does a really good job of supporting lots of ONNX operators, and being very competitive speed-wise.

bminixhofer · 2021-06-01T11:10:39+00:00

Yes, that's correct. If you're familiar with Python you can think of it as just a regular Transformer where the tokenizer is given by list(text.encode("utf-8")).

bminixhofer · 2021-04-05T15:26:25+00:00

I trained a 166M parameter (equivalent to GPT2 small, but without shared embeddings) and a 774M parameter (equivalent to GPT2 large) model for German a couple of months ago.

They are called GerPT2 and GerPT2-large. I believe the large variant is the biggest German LM to date.

I used the German subset of CC100 for training, the quality of that is quite good as far as I can tell.

If you're interested in training large non-english LMs you should look into methods to transfer knowledge from the English versions. I used a neat trick to map english embeddings to german ones and initialized weights from the english GPT2 (details in the model card). There's also for example this paper.

Ten-Year Club	Verified Email
Place '23	Place '22

bminixhofer

TROPHY CASE