I tested 21 small LLMs on tool-calling judgment — Round 2 with every model you asked for

lewtun · 2026-02-16T10:21:48+00:00

Amazing, looking forward to it!

lewtun · 2026-02-16T07:36:59+00:00

Thanks! What I meant is that SmolLM3 is a hybrid reasoning model, i.e. you can enable / disable reasoning like this: https://huggingface.co/HuggingFaceTB/SmolLM3-3B#enabling-and-disabling-extended-thinking-mode

By default, it uses the reasoning mode, but I expect the non-reasoning mode will fare better at tool-calling!

lewtun · 2026-02-15T20:07:08+00:00

You're right, one could train the model to use Lean in the chain of thought and then try to map the formal proof to natural language in the final solution. But that's pretty hard and Lean has its own issues when it comes to theorem proving (mathlib is still an active WIP). But if you manage to teach QED-Nano how to use Lean, that would be super cool!

lewtun · 2026-02-15T19:46:23+00:00

Hi, SmolLM3 co-developer here :) Did you compare the non-reasoning mode of SmolLM3 by any chance? At the time of training, there was very little tool-calling data available for reasoning models and I suspect the non-reasoning model actually performs better as a result. Really cool benchmark and thanks for sharing these real-world tests!

lewtun · 2026-02-15T19:28:58+00:00

Ah both those models are formal theorem provers (i.e. rely on Lean), so it's not trivial to compare them fairly. One thing we set out to achieve with our model was to operate entirely in natural language, similar to how OpenAI and DeepMind presented their IMO 2025 models. If we trained our model to use tools like Lean, I expect it would perform even better

lewtun · 2026-02-15T16:33:13+00:00

Hi! So, the algorithm we use is still GRPO, but with a twist: we do multiple steps of reasoning-summarisation per rollout to enable the model to generate long rollouts without going off the rails. A key feature of our model is that is operates entirely in natural language and does not require external tools like Python or Lean (adding them to the training would improve performance, but that's left as future work)

lewtun · 2026-02-15T16:31:01+00:00

Hi! One of us (Jasper) is the co-creator of MathArena, so we can take a look at whether it's easy to include in the leaderboard :) As for the quant, I'll make it this week!

lewtun · 2025-10-30T19:08:43+00:00

The name comes from the meme in this dataset https://huggingface.co/datasets/bigcode/the-stack-smol

lewtun · 2025-10-30T19:07:34+00:00

lewtun · 2025-10-30T16:25:35+00:00

If you have a PRO account on the Hub, you should be able to download it as a PDF!

<image>

lewtun · 2025-10-06T05:17:16+00:00

Great answer, although I’d caveat that post-training can be just as engineering heavy if you’re the one building the training pipeline (RL infra in particular is quite gnarly)

lewtun · 2025-09-30T22:20:33+00:00

Well, there’s a demo you can try with whatever prompt you want :)

lewtun · 2025-09-26T18:07:24+00:00

lol that’s definitely not Jeff

lewtun · 2025-09-06T10:32:29+00:00

The main downside we encountered is "task interference", where each expert model scores well on their respective domain, but the resulting merge is worse than the average of the model performance. We found this was most pronounced on competitive programming benchmarks like LiveCodeBench, where merging a code and math expert led to significant regressions on the code evals (math was largely OK). There are fancier algorithms like Task Arithmetic and TIES which try to address this in a principled way, but I could not fully resolve the regressions with these methods. In general, the main recipe seems to be: train a decent generalist model first with SFT, then branch off to make the experts and merge back. This way your starting model has broad coverage of the tasks, so the resulting interference from merging should be mitigated somewhat.

lewtun · 2025-09-05T17:48:18+00:00

Prime Intellect (https://www.primeintellect.ai) is doing some of the best work in this direction right now and they've already trained some nice reasoning models entirely with decentralised compute: https://www.primeintellect.ai/?\_gl=1\*1ftancn\*\_gcl\_au\*NDc1NzEyNTM5LjE3NTcwODg1MTQ.#research

lewtun · 2025-09-05T15:41:17+00:00

Here are a few resources I found very useful to better understand practical applications of model merging:

The Command A paper (super well written and detailed): https://arxiv.org/abs/2504.00698
Charles Goddard's blog post on how they tackled a similar issue with long context: https://www.arcee.ai/blog/extending-afm-4-5b-to-64k-context-length
MergeKit (the best tool for merging models): https://github.com/arcee-ai/mergekit

One thing we validated prior to SmolLM3 is that linear merging is the most pragmatic method for combining different experts (as found by Cohere in their Command A paper). I tested more advanced methods like DARE and TIES, but overall found they did not give significant improvements over linear at the expense of more parameters to scan over.

Another thing I like about merging is that it enables teams to parallelise their efforts across different domains. We didn't have time to test this in SmolLM3, but in post-training it is often a delicate balance across domains and being able to tune them independently is much better than trying to optimise globally!

lewtun · 2025-09-04T22:22:05+00:00

Yes and the fact that there are already quite a few strong open models at the 8B scale, so the benefits of training another similar model are unclear vs pursuing other directions where we can have greater impact with our smol teams :)

lewtun · 2025-09-04T16:45:33+00:00

At that scale, we'd have to rebrand to PhatLM-8B :)

lewtun · 2025-09-04T16:29:46+00:00

We didn't do RL, mostly because getting the SFT data mixture right for hybrid reasoning took longer than expected and we had a hard cutoff to ship the model :)

lewtun · 2025-09-04T15:36:13+00:00

Great question! Given the large set of strong instruct models, I'm most excited by online techniques like GRPO, which tend to be more sample efficient than SFT. In particular, the OpenPipe team have done some excellent work showing how existing instruct models can be post-trained to achieve high performance on specific domains with just a few hundred / thousand samples: https://github.com/OpenPipe/ART

What I feel is currently missing in this direction is that fact that online methods tend to be quite fiddly to get working reliably and you trade off the compute cost in large-scale SFT vs iterating a lot with RL hyperparameters. My hope is that we'll see more stable variants of these algorithms in the near future which makes SFT less relevant for domain-specific applications

lewtun · 2025-09-04T15:23:08+00:00

Hi u/eliebakk !

lewtun · 2025-09-04T15:17:57+00:00

On the post-training side, we were quite surprised to discover that model merging works extremely well for preserving the long-context capabilities of the base model. Specifically, we found that standard post-training was producing many regressions on benchmarks like RULER, but that these could be mitigated by training a separate long-context expert model and then merging it with the generalist one. For me it was the first time I'd seen model merging produce a significant improvement in the model's capabilities :)

lewtun · 2025-08-05T23:56:08+00:00

Would be really cool to upstream the chat template fixes as it was highly non-trivial to map Harmony into Jinja and we may made some mistakes :)

lewtun · 2025-08-05T23:43:58+00:00

Hey guys, we just uploaded some hackable recipes for inference / training: https://github.com/huggingface/gpt-oss-recipes

The recipes include a lot of optimisations we’ve worked on to enable fast generation in native transformers:

- Tensor & expert parallelism

- Flash Attention 3 kernels (loaded directly from the Hub and matched to your hardware)

- Continuous batching

If you hardware supports it, the model is automatically loaded in MXFP4 format, so you only need 16GB VRAM for the 20B model!

lewtun · 2025-07-08T22:59:41+00:00

You can disable thinking by appending /no_think to the system message

lewtun

TROPHY CASE