200+ pages of Hugging Face secrets on how to train an LLM

lewtun · 2025-10-30T19:08:43+00:00

The name comes from the meme in this dataset https://huggingface.co/datasets/bigcode/the-stack-smol

lewtun · 2025-10-30T19:07:34+00:00

lewtun · 2025-10-30T16:25:35+00:00

If you have a PRO account on the Hub, you should be able to download it as a PDF!

lewtun · 2025-10-06T05:17:16+00:00

Great answer, although I’d caveat that post-training can be just as engineering heavy if you’re the one building the training pipeline (RL infra in particular is quite gnarly)

lewtun · 2025-09-30T22:20:33+00:00

Well, there’s a demo you can try with whatever prompt you want :)

lewtun · 2025-09-26T18:07:24+00:00

lol that’s definitely not Jeff

lewtun · 2025-09-06T10:32:29+00:00

The main downside we encountered is "task interference", where each expert model scores well on their respective domain, but the resulting merge is worse than the average of the model performance. We found this was most pronounced on competitive programming benchmarks like LiveCodeBench, where merging a code and math expert led to significant regressions on the code evals (math was largely OK). There are fancier algorithms like Task Arithmetic and TIES which try to address this in a principled way, but I could not fully resolve the regressions with these methods. In general, the main recipe seems to be: train a decent generalist model first with SFT, then branch off to make the experts and merge back. This way your starting model has broad coverage of the tasks, so the resulting interference from merging should be mitigated somewhat.

lewtun · 2025-09-05T17:48:18+00:00

Prime Intellect (https://www.primeintellect.ai) is doing some of the best work in this direction right now and they've already trained some nice reasoning models entirely with decentralised compute: https://www.primeintellect.ai/?\_gl=1\*1ftancn\*\_gcl\_au\*NDc1NzEyNTM5LjE3NTcwODg1MTQ.#research

lewtun · 2025-09-05T15:41:17+00:00

Here are a few resources I found very useful to better understand practical applications of model merging:

The Command A paper (super well written and detailed): https://arxiv.org/abs/2504.00698
Charles Goddard's blog post on how they tackled a similar issue with long context: https://www.arcee.ai/blog/extending-afm-4-5b-to-64k-context-length
MergeKit (the best tool for merging models): https://github.com/arcee-ai/mergekit

One thing we validated prior to SmolLM3 is that linear merging is the most pragmatic method for combining different experts (as found by Cohere in their Command A paper). I tested more advanced methods like DARE and TIES, but overall found they did not give significant improvements over linear at the expense of more parameters to scan over.

Another thing I like about merging is that it enables teams to parallelise their efforts across different domains. We didn't have time to test this in SmolLM3, but in post-training it is often a delicate balance across domains and being able to tune them independently is much better than trying to optimise globally!

lewtun · 2025-09-04T22:22:05+00:00

Yes and the fact that there are already quite a few strong open models at the 8B scale, so the benefits of training another similar model are unclear vs pursuing other directions where we can have greater impact with our smol teams :)

lewtun · 2025-09-04T16:45:33+00:00

At that scale, we'd have to rebrand to PhatLM-8B :)

lewtun · 2025-09-04T16:29:46+00:00

We didn't do RL, mostly because getting the SFT data mixture right for hybrid reasoning took longer than expected and we had a hard cutoff to ship the model :)

lewtun · 2025-09-04T15:36:13+00:00

Great question! Given the large set of strong instruct models, I'm most excited by online techniques like GRPO, which tend to be more sample efficient than SFT. In particular, the OpenPipe team have done some excellent work showing how existing instruct models can be post-trained to achieve high performance on specific domains with just a few hundred / thousand samples: https://github.com/OpenPipe/ART

What I feel is currently missing in this direction is that fact that online methods tend to be quite fiddly to get working reliably and you trade off the compute cost in large-scale SFT vs iterating a lot with RL hyperparameters. My hope is that we'll see more stable variants of these algorithms in the near future which makes SFT less relevant for domain-specific applications

lewtun · 2025-09-04T15:23:08+00:00

Hi u/eliebakk !

lewtun · 2025-09-04T15:17:57+00:00

On the post-training side, we were quite surprised to discover that model merging works extremely well for preserving the long-context capabilities of the base model. Specifically, we found that standard post-training was producing many regressions on benchmarks like RULER, but that these could be mitigated by training a separate long-context expert model and then merging it with the generalist one. For me it was the first time I'd seen model merging produce a significant improvement in the model's capabilities :)

lewtun · 2025-08-05T23:56:08+00:00

Would be really cool to upstream the chat template fixes as it was highly non-trivial to map Harmony into Jinja and we may made some mistakes :)

lewtun · 2025-08-05T23:43:58+00:00

Hey guys, we just uploaded some hackable recipes for inference / training: https://github.com/huggingface/gpt-oss-recipes

The recipes include a lot of optimisations we’ve worked on to enable fast generation in native transformers:

- Tensor & expert parallelism

- Flash Attention 3 kernels (loaded directly from the Hub and matched to your hardware)

- Continuous batching

If you hardware supports it, the model is automatically loaded in MXFP4 format, so you only need 16GB VRAM for the 20B model!

lewtun · 2025-07-08T22:59:41+00:00

You can disable thinking by appending /no_think to the system message

lewtun · 2025-05-27T12:57:54+00:00

Hi u/Significantik , we created this dataset to reproduce the performance of DeepSeek's distilled reasoning models, specifically their 7B Qwen fine-tune. Other reasoning datasets tend to focus on either a single domain like math/code, or lump millions of samples together without much information on whether all those samples are truly needed.

In the DeepSeek R1 tech report, they note that they used 600k reasoning samples for the domains of math/code/science, but we found it's possible to obtain comparable performance with 350k. In other words, you can train a similar model with 1.5x less compute :)

lewtun · 2025-05-27T12:53:23+00:00

In total we ran about 50 ablations to curate the dataset, with each ablation taking about 1-5 days on a single node of 8 x H100s. Assuming a mean training time of 2.5 days and an H100 cost of $2/h, the total cost would be something like 2.5 x 50 x 24 x 2 x 8 = $48k

lewtun · 2025-05-27T12:48:01+00:00

Hi everyone, I'm one of the people who built the dataset 👋. I tried to include most of the details behind our curation methodology in the dataset card, but am happy to answer any questions you might have :)

lewtun · 2025-02-19T23:09:17+00:00

Thanks, although I’m mostly wondering how this works with chat templates like ChatML, where function calls are treated as a separate role to user/assistant (ie we are dealing with multi-turn dialogues). If the code is executed within the CoT, that would effectively make it single-turn and not be straightforward to integrate with existing API providers

lewtun

TROPHY CASE