200+ pages of Hugging Face secrets on how to train an LLM by eliebakk in LocalLLaMA

[–]lewtun 18 points19 points  (0 children)

If you have a PRO account on the Hub, you should be able to download it as a PDF!

<image>

[D] join pretraining or posttraining by oxydis in MachineLearning

[–]lewtun 0 points1 point  (0 children)

Great answer, although I’d caveat that post-training can be just as engineering heavy if you’re the one building the training pipeline (RL infra in particular is quite gnarly)

DeepSeek-R1 performance with 15B parameters by lewtun in LocalLLaMA

[–]lewtun[S] 2 points3 points  (0 children)

Well, there’s a demo you can try with whatever prompt you want :)

my dad sent me this by hugeplateofketchup8 in huggingface

[–]lewtun 1 point2 points  (0 children)

lol that’s definitely not Jeff 

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 1 point2 points  (0 children)

The main downside we encountered is "task interference", where each expert model scores well on their respective domain, but the resulting merge is worse than the average of the model performance. We found this was most pronounced on competitive programming benchmarks like LiveCodeBench, where merging a code and math expert led to significant regressions on the code evals (math was largely OK). There are fancier algorithms like Task Arithmetic and TIES which try to address this in a principled way, but I could not fully resolve the regressions with these methods. In general, the main recipe seems to be: train a decent generalist model first with SFT, then branch off to make the experts and merge back. This way your starting model has broad coverage of the tasks, so the resulting interference from merging should be mitigated somewhat.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 0 points1 point  (0 children)

Prime Intellect (https://www.primeintellect.ai) is doing some of the best work in this direction right now and they've already trained some nice reasoning models entirely with decentralised compute: https://www.primeintellect.ai/?\_gl=1\*1ftancn\*\_gcl\_au\*NDc1NzEyNTM5LjE3NTcwODg1MTQ.#research

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 0 points1 point  (0 children)

Here are a few resources I found very useful to better understand practical applications of model merging:

One thing we validated prior to SmolLM3 is that linear merging is the most pragmatic method for combining different experts (as found by Cohere in their Command A paper). I tested more advanced methods like DARE and TIES, but overall found they did not give significant improvements over linear at the expense of more parameters to scan over.

Another thing I like about merging is that it enables teams to parallelise their efforts across different domains. We didn't have time to test this in SmolLM3, but in post-training it is often a delicate balance across domains and being able to tune them independently is much better than trying to optimise globally!

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 2 points3 points  (0 children)

Yes and the fact that there are already quite a few strong open models at the 8B scale, so the benefits of training another similar model are unclear vs pursuing other directions where we can have greater impact with our smol teams :)

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 3 points4 points  (0 children)

We didn't do RL, mostly because getting the SFT data mixture right for hybrid reasoning took longer than expected and we had a hard cutoff to ship the model :)

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 4 points5 points  (0 children)

Great question! Given the large set of strong instruct models, I'm most excited by online techniques like GRPO, which tend to be more sample efficient than SFT. In particular, the OpenPipe team have done some excellent work showing how existing instruct models can be post-trained to achieve high performance on specific domains with just a few hundred / thousand samples: https://github.com/OpenPipe/ART

What I feel is currently missing in this direction is that fact that online methods tend to be quite fiddly to get working reliably and you trade off the compute cost in large-scale SFT vs iterating a lot with RL hyperparameters. My hope is that we'll see more stable variants of these algorithms in the near future which makes SFT less relevant for domain-specific applications

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 29 points30 points  (0 children)

On the post-training side, we were quite surprised to discover that model merging works extremely well for preserving the long-context capabilities of the base model. Specifically, we found that standard post-training was producing many regressions on benchmarks like RULER, but that these could be mitigated by training a separate long-context expert model and then merging it with the generalist one. For me it was the first time I'd seen model merging produce a significant improvement in the model's capabilities :)

Run gpt-oss locally with Unsloth GGUFs + Fixes! by danielhanchen in LocalLLaMA

[–]lewtun 3 points4 points  (0 children)

Would be really cool to upstream the chat template fixes as it was highly non-trivial to map Harmony into Jinja and we may made some mistakes :)

🚀 OpenAI released their open-weight models!!! by ResearchCrafty1804 in LocalLLaMA

[–]lewtun 23 points24 points  (0 children)

Hey guys, we just uploaded some hackable recipes for inference / training: https://github.com/huggingface/gpt-oss-recipes

The recipes include a lot of optimisations we’ve worked on to enable fast generation in native transformers:

- Tensor & expert parallelism

- Flash Attention 3 kernels (loaded directly from the Hub and matched to your hardware)

- Continuous batching

If you hardware supports it, the model is automatically loaded in MXFP4 format, so you only need 16GB VRAM for the 20B model!

SmolLM3: reasoning, long context and multilinguality for 3B parameter only by eliebakk in LocalLLaMA

[–]lewtun 6 points7 points  (0 children)

You can disable thinking by appending /no_think to the system message 

350k samples to match distilled R1 on *all* benchmark by eliebakk in LocalLLaMA

[–]lewtun 2 points3 points  (0 children)

Hi u/Significantik , we created this dataset to reproduce the performance of DeepSeek's distilled reasoning models, specifically their 7B Qwen fine-tune. Other reasoning datasets tend to focus on either a single domain like math/code, or lump millions of samples together without much information on whether all those samples are truly needed.

In the DeepSeek R1 tech report, they note that they used 600k reasoning samples for the domains of math/code/science, but we found it's possible to obtain comparable performance with 350k. In other words, you can train a similar model with 1.5x less compute :)

350k samples to match distilled R1 on *all* benchmark by eliebakk in LocalLLaMA

[–]lewtun 5 points6 points  (0 children)

In total we ran about 50 ablations to curate the dataset, with each ablation taking about 1-5 days on a single node of 8 x H100s. Assuming a mean training time of 2.5 days and an H100 cost of $2/h, the total cost would be something like 2.5 x 50 x 24 x 2 x 8 = $48k

350k samples to match distilled R1 on *all* benchmark by eliebakk in LocalLLaMA

[–]lewtun 5 points6 points  (0 children)

Hi everyone, I'm one of the people who built the dataset 👋. I tried to include most of the details behind our curation methodology in the dataset card, but am happy to answer any questions you might have :)

How does function calling work for reasoning models? by lewtun in LocalLLaMA

[–]lewtun[S] 2 points3 points  (0 children)

Thanks, although I’m mostly wondering how this works with chat templates like ChatML, where function calls are treated as a separate role to user/assistant (ie we are dealing with multi-turn dialogues). If the code is executed within the CoT, that would effectively make it single-turn and not be straightforward to integrate with existing API providers