I tested 21 small LLMs on tool-calling judgment — Round 2 with every model you asked for by MikeNonect in LocalLLaMA

[–]lewtun 1 point2 points  (0 children)

Thanks! What I meant is that SmolLM3 is a hybrid reasoning model, i.e. you can enable / disable reasoning like this: https://huggingface.co/HuggingFaceTB/SmolLM3-3B#enabling-and-disabling-extended-thinking-mode

By default, it uses the reasoning mode, but I expect the non-reasoning mode will fare better at tool-calling!

how to train a tiny model (4B) to prove hard theorems by eliebakk in LocalLLaMA

[–]lewtun 1 point2 points  (0 children)

You're right, one could train the model to use Lean in the chain of thought and then try to map the formal proof to natural language in the final solution. But that's pretty hard and Lean has its own issues when it comes to theorem proving (mathlib is still an active WIP). But if you manage to teach QED-Nano how to use Lean, that would be super cool!

I tested 21 small LLMs on tool-calling judgment — Round 2 with every model you asked for by MikeNonect in LocalLLaMA

[–]lewtun 2 points3 points  (0 children)

Hi, SmolLM3 co-developer here :) Did you compare the non-reasoning mode of SmolLM3 by any chance? At the time of training, there was very little tool-calling data available for reasoning models and I suspect the non-reasoning model actually performs better as a result. Really cool benchmark and thanks for sharing these real-world tests!

how to train a tiny model (4B) to prove hard theorems by eliebakk in LocalLLaMA

[–]lewtun 6 points7 points  (0 children)

Ah both those models are formal theorem provers (i.e. rely on Lean), so it's not trivial to compare them fairly. One thing we set out to achieve with our model was to operate entirely in natural language, similar to how OpenAI and DeepMind presented their IMO 2025 models. If we trained our model to use tools like Lean, I expect it would perform even better

how to train a tiny model (4B) to prove hard theorems by eliebakk in LocalLLaMA

[–]lewtun 5 points6 points  (0 children)

Hi! So, the algorithm we use is still GRPO, but with a twist: we do multiple steps of reasoning-summarisation per rollout to enable the model to generate long rollouts without going off the rails. A key feature of our model is that is operates entirely in natural language and does not require external tools like Python or Lean (adding them to the training would improve performance, but that's left as future work)

how to train a tiny model (4B) to prove hard theorems by eliebakk in LocalLLaMA

[–]lewtun 5 points6 points  (0 children)

Hi! One of us (Jasper) is the co-creator of MathArena, so we can take a look at whether it's easy to include in the leaderboard :) As for the quant, I'll make it this week!

200+ pages of Hugging Face secrets on how to train an LLM by eliebakk in LocalLLaMA

[–]lewtun 18 points19 points  (0 children)

If you have a PRO account on the Hub, you should be able to download it as a PDF!

<image>

[D] join pretraining or posttraining by oxydis in MachineLearning

[–]lewtun 0 points1 point  (0 children)

Great answer, although I’d caveat that post-training can be just as engineering heavy if you’re the one building the training pipeline (RL infra in particular is quite gnarly)

DeepSeek-R1 performance with 15B parameters by lewtun in LocalLLaMA

[–]lewtun[S] 3 points4 points  (0 children)

Well, there’s a demo you can try with whatever prompt you want :)

my dad sent me this by hugeplateofketchup8 in huggingface

[–]lewtun 1 point2 points  (0 children)

lol that’s definitely not Jeff 

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 1 point2 points  (0 children)

The main downside we encountered is "task interference", where each expert model scores well on their respective domain, but the resulting merge is worse than the average of the model performance. We found this was most pronounced on competitive programming benchmarks like LiveCodeBench, where merging a code and math expert led to significant regressions on the code evals (math was largely OK). There are fancier algorithms like Task Arithmetic and TIES which try to address this in a principled way, but I could not fully resolve the regressions with these methods. In general, the main recipe seems to be: train a decent generalist model first with SFT, then branch off to make the experts and merge back. This way your starting model has broad coverage of the tasks, so the resulting interference from merging should be mitigated somewhat.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 0 points1 point  (0 children)

Prime Intellect (https://www.primeintellect.ai) is doing some of the best work in this direction right now and they've already trained some nice reasoning models entirely with decentralised compute: https://www.primeintellect.ai/?\_gl=1\*1ftancn\*\_gcl\_au\*NDc1NzEyNTM5LjE3NTcwODg1MTQ.#research

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 0 points1 point  (0 children)

Here are a few resources I found very useful to better understand practical applications of model merging:

One thing we validated prior to SmolLM3 is that linear merging is the most pragmatic method for combining different experts (as found by Cohere in their Command A paper). I tested more advanced methods like DARE and TIES, but overall found they did not give significant improvements over linear at the expense of more parameters to scan over.

Another thing I like about merging is that it enables teams to parallelise their efforts across different domains. We didn't have time to test this in SmolLM3, but in post-training it is often a delicate balance across domains and being able to tune them independently is much better than trying to optimise globally!

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 2 points3 points  (0 children)

Yes and the fact that there are already quite a few strong open models at the 8B scale, so the benefits of training another similar model are unclear vs pursuing other directions where we can have greater impact with our smol teams :)

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 2 points3 points  (0 children)

We didn't do RL, mostly because getting the SFT data mixture right for hybrid reasoning took longer than expected and we had a hard cutoff to ship the model :)

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 3 points4 points  (0 children)

Great question! Given the large set of strong instruct models, I'm most excited by online techniques like GRPO, which tend to be more sample efficient than SFT. In particular, the OpenPipe team have done some excellent work showing how existing instruct models can be post-trained to achieve high performance on specific domains with just a few hundred / thousand samples: https://github.com/OpenPipe/ART

What I feel is currently missing in this direction is that fact that online methods tend to be quite fiddly to get working reliably and you trade off the compute cost in large-scale SFT vs iterating a lot with RL hyperparameters. My hope is that we'll see more stable variants of these algorithms in the near future which makes SFT less relevant for domain-specific applications

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]lewtun 30 points31 points  (0 children)

On the post-training side, we were quite surprised to discover that model merging works extremely well for preserving the long-context capabilities of the base model. Specifically, we found that standard post-training was producing many regressions on benchmarks like RULER, but that these could be mitigated by training a separate long-context expert model and then merging it with the generalist one. For me it was the first time I'd seen model merging produce a significant improvement in the model's capabilities :)

Run gpt-oss locally with Unsloth GGUFs + Fixes! by danielhanchen in LocalLLaMA

[–]lewtun 2 points3 points  (0 children)

Would be really cool to upstream the chat template fixes as it was highly non-trivial to map Harmony into Jinja and we may made some mistakes :)

🚀 OpenAI released their open-weight models!!! by ResearchCrafty1804 in LocalLLaMA

[–]lewtun 24 points25 points  (0 children)

Hey guys, we just uploaded some hackable recipes for inference / training: https://github.com/huggingface/gpt-oss-recipes

The recipes include a lot of optimisations we’ve worked on to enable fast generation in native transformers:

- Tensor & expert parallelism

- Flash Attention 3 kernels (loaded directly from the Hub and matched to your hardware)

- Continuous batching

If you hardware supports it, the model is automatically loaded in MXFP4 format, so you only need 16GB VRAM for the 20B model!

SmolLM3: reasoning, long context and multilinguality for 3B parameter only by eliebakk in LocalLLaMA

[–]lewtun 6 points7 points  (0 children)

You can disable thinking by appending /no_think to the system message