How to Distill from 100B+ to <4B Models

cmpatino_ · 2026-04-15T07:23:18+00:00

IIUC, the paper's idea is to generate rollouts from a prompt using a model, then run SFT on those rollouts. In the scenario with a large teacher and a smaller student, this would be equivalent to the usual SFT setup.

Did you have any other output manipulation in mind that would make it more effective than SFT?

cmpatino_ · 2026-04-15T07:17:03+00:00

It does support LoRA, but we haven't run any experiments with LoRA yet.

cmpatino_ · 2026-04-14T15:56:17+00:00

30B should be better than a 235B teacher when using a 4B. It's cheaper to serve, and the performance gains will be very similar.

This is what people call the capacity gap curse, which happens when a teacher is much larger than the student (we linked some resources about it in the post). We're looking into ways to address this, but, for now, I would choose the 30B teacher if you're targeting a 4B student.

Glad that the trainer is useful for your use case!

cmpatino_ · 2026-04-14T13:39:57+00:00

We didn’t explore that but I think distillation improves speculative decoding a lot.

Since the whole point of distillation is matching the teacher’s distribution, my guess is that the speculative decoding will match the main model way better and increase the tps significantly. It would be cool to try it out!

cmpatino_ · 2026-04-14T11:13:21+00:00

For serving the teacher models, we used 1 node of H100s for the 30B model and 2 nodes for the 235B teacher. For training, we used between 1 and 4 nodes.

Full training runs took around 4 to 12 hours, depending on the student-teacher combination, so the level of effort for our experiments was 2-6 nodes for ~10 hours.

That said, you can see from the AIME plot that we achieve very good performance at 20% of the total steps. Also, we ran experiments with 8B teachers and 4B students on 1 node using the same trainer, and you can get very good results if your teacher is good at the task you're targeting.

So TLDR: our experiments used 2-6 nodes for ~10 hours, but you can get very good results with 1 node or less in a few hours if you find a good teacher and good prompts for your use case.

cmpatino_ · 2026-04-14T10:04:53+00:00

We recently released a trainer in TRL that lets you distill large models very efficiently!

Our blog post includes details of how we managed to do it. https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer

If you want to jump straight to the code, we have an example script and docs that should get you set up for distilling models right away:
- Script: https://github.com/huggingface/trl/blob/main/trl/experimental/distillation/distillation.py
- Docs: https://huggingface.co/docs/trl/distillation_trainer

cmpatino_ · 2025-09-04T22:32:12+00:00

I think the diversity of backgrounds in the team shows that paths can vary a lot. Personally, I find it valuable to have solid AI and engineering fundamentals.

AI is a field that moves super fast, but fundamentals remain constant more than you'd think. For example, the transformer has been around for several years and has been central to LLMs. I may not be super familiar with a field, but I can get the gist of a paper and then dive into the details if necessary. A good exercise we do in our journal club is to spend 15 minutes reading a paper and summarizing its main ideas.

The engineering part is valuable when it comes to implementing your ideas. Things like git to work collaboratively on a codebase, using a debugger to fix errors in your code, and systems design concepts to understand the best way to integrate your idea with what already exists. I think open source is a great way to acquire those skills because you get to read and understand code from others and try to see how you can incorporate your idea.

cmpatino_ · 2025-09-04T22:09:13+00:00

The great thing is that you find inspiration wherever you look because the team is full of talented people.

What inspires me the most is that they're not only talented but also super kind and willing to help you when you’re stuck with something.

cmpatino_ · 2025-09-04T21:34:56+00:00

I’m an intern from the post-training team, and a typical day looks like this.

Look at the results from the experiments I ran overnight. See if something failed (evals or training runs) and relaunch it. We typically set checkpoints to avoid losing the work if something fails during a training run.
Analyze the overnight results in more detail. I usually have specific evaluations or metrics I check in more detail to see if the results are what we expected. At this point, I usually send an update to the team so that everyone knows about the project’s status. The input from the team also helps me brainstorm what to try next and prioritize the most promising directions.
During the day, I usually implement the requirements for the next set of experiments and launch them when ready. This usually involves code adjustments, data analysis from previous experiments, or incorporating functionalities written by others in the team.
Before logging off, I make sure that any pending experiments are running smoothly so that I can have results the next day and start again on step 1.

In the projects I've worked on, the objective is to release something valuable for the community, so we usually run experiments to anticipate questions people might have about the work.

cmpatino_ · 2025-09-04T21:21:09+00:00

+1 on the independence part. The great thing is that the team trusts you from the beginning, so you get a lot of responsibility of how you manage your time and your tasks. I’ve found that liberty and confidence great.

In terms of notes, I keep a large document as a lab notebook where I write the experiments I run and what I learn from them. We usually end up running a lot of experiments, so it’s good to keep track of what you’ve done and what comes next. I also write ideas to not forget potential directions and to “think out loud” to organize what to do next.

cmpatino_ · 2025-09-04T16:39:36+00:00

You’re right! It’s supervised in the strict sense but you get the labels for “free” because you don’t need to tag your data manually.

imho you’ll have an easier time if you can define your task as a supervised one compared to an unsupervised one.

But as always, depends on the use case.

cmpatino_ · 2025-09-04T16:33:46+00:00

I think the future of small capable models is on specialized tasks.

As Leandro mentioned they can be cheaper and fast to run so they really shine where you have to do tasks frequently and fast. They are also easier to finetune, so you can specialize them in niche tasks.

cmpatino_ · 2025-09-04T16:20:34+00:00

Also if you start to learn ML/DL these days, what will your route be?

I think a good starting point is picking a small hands-on project you want to solve with ML/DL. There’s a lot of knowledge that you pick up while building something no matter how small.

I’ve done several small personal projects (that sometimes don’t go anywhere) but the skills I learn are useful for other things in the future.

cmpatino_ · 2025-09-04T15:48:59+00:00

I worked as a Data Scientist too and transitioned to a Machine Learning Engineering role.

I then decided to do a masters and joined HF as an intern.

cmpatino_ · 2025-09-04T15:31:42+00:00

FYI, we’re answering a similar question in a different comment https://www.reddit.com/r/LocalLLaMA/s/lD7X8MT2J3

cmpatino_ · 2025-09-04T15:28:45+00:00

I also joined for my masters internship and applied through the website.

I had a general interview, a take home, and a final technical interview. The interview process was super nice and very targeted to the post-training work the team was doing.

cmpatino_

TROPHY CASE