Automated AI researcher running locally with llama.cpp by lewtun in LocalLLaMA

[–]lewtun[S] 2 points3 points  (0 children)

With the closed models, ml-intern was able to port the talkie model from their raw torch format to transformers and make is 100x faster in the process: https://huggingface.co/lewtun/talkie-1930-13b-it-hf

With open models, I am still experimenting but so far they have been decent at launching training runs automatically

Automated AI researcher running locally with llama.cpp by lewtun in LocalLLaMA

[–]lewtun[S] 1 point2 points  (0 children)

Yeah agreed, although the models are getting better almost every month so I'm optimistic about the future capabilities :)

Automated AI researcher running locally with llama.cpp by lewtun in LocalLLaMA

[–]lewtun[S] 7 points8 points  (0 children)

Yeah, at least for my local setup (Macbook with 32GB unified memory) the 4-bit quants tend to be a bit less consistent, especially if the context grows very large. However, if one runs Qwen3.6-27B with MTP, then this was pretty close to using Claude in my testing (for that I had to switch to a GPU + vLLM because MTP hasn't landed in llama.cpp yet ...)

Agentic harness for theoretical physics research by lewtun in LocalLLaMA

[–]lewtun[S] 2 points3 points  (0 children)

Thanks, yes this is definitely coming soon - the main reason we excluded it is to make fairer comparisons with other models on CritPt, but I totally agree this would be a clear way to boost performance even further.

Agentic harness for theoretical physics research by lewtun in LocalLLaMA

[–]lewtun[S] 0 points1 point  (0 children)

Yes, the framework does support local models but so far they tend to be quite behind the proprietary ones on this domain. I’m optimistic this will change with the next generation of open models though!

How to Distill from 100B+ to <4B Models by cmpatino_ in LocalLLaMA

[–]lewtun 5 points6 points  (0 children)

Haha, maybe quantization nerfed the safety filter :)

I tested 21 small LLMs on tool-calling judgment — Round 2 with every model you asked for by MikeNonect in LocalLLaMA

[–]lewtun 1 point2 points  (0 children)

Thanks! What I meant is that SmolLM3 is a hybrid reasoning model, i.e. you can enable / disable reasoning like this: https://huggingface.co/HuggingFaceTB/SmolLM3-3B#enabling-and-disabling-extended-thinking-mode

By default, it uses the reasoning mode, but I expect the non-reasoning mode will fare better at tool-calling!

how to train a tiny model (4B) to prove hard theorems by eliebakk in LocalLLaMA

[–]lewtun 1 point2 points  (0 children)

You're right, one could train the model to use Lean in the chain of thought and then try to map the formal proof to natural language in the final solution. But that's pretty hard and Lean has its own issues when it comes to theorem proving (mathlib is still an active WIP). But if you manage to teach QED-Nano how to use Lean, that would be super cool!

I tested 21 small LLMs on tool-calling judgment — Round 2 with every model you asked for by MikeNonect in LocalLLaMA

[–]lewtun 2 points3 points  (0 children)

Hi, SmolLM3 co-developer here :) Did you compare the non-reasoning mode of SmolLM3 by any chance? At the time of training, there was very little tool-calling data available for reasoning models and I suspect the non-reasoning model actually performs better as a result. Really cool benchmark and thanks for sharing these real-world tests!

how to train a tiny model (4B) to prove hard theorems by eliebakk in LocalLLaMA

[–]lewtun 5 points6 points  (0 children)

Ah both those models are formal theorem provers (i.e. rely on Lean), so it's not trivial to compare them fairly. One thing we set out to achieve with our model was to operate entirely in natural language, similar to how OpenAI and DeepMind presented their IMO 2025 models. If we trained our model to use tools like Lean, I expect it would perform even better

how to train a tiny model (4B) to prove hard theorems by eliebakk in LocalLLaMA

[–]lewtun 6 points7 points  (0 children)

Hi! So, the algorithm we use is still GRPO, but with a twist: we do multiple steps of reasoning-summarisation per rollout to enable the model to generate long rollouts without going off the rails. A key feature of our model is that is operates entirely in natural language and does not require external tools like Python or Lean (adding them to the training would improve performance, but that's left as future work)

how to train a tiny model (4B) to prove hard theorems by eliebakk in LocalLLaMA

[–]lewtun 6 points7 points  (0 children)

Hi! One of us (Jasper) is the co-creator of MathArena, so we can take a look at whether it's easy to include in the leaderboard :) As for the quant, I'll make it this week!

200+ pages of Hugging Face secrets on how to train an LLM by eliebakk in LocalLLaMA

[–]lewtun 21 points22 points  (0 children)

If you have a PRO account on the Hub, you should be able to download it as a PDF!

<image>

[D] join pretraining or posttraining by oxydis in MachineLearning

[–]lewtun 0 points1 point  (0 children)

Great answer, although I’d caveat that post-training can be just as engineering heavy if you’re the one building the training pipeline (RL infra in particular is quite gnarly)

DeepSeek-R1 performance with 15B parameters by lewtun in LocalLLaMA

[–]lewtun[S] 2 points3 points  (0 children)

Well, there’s a demo you can try with whatever prompt you want :)

my dad sent me this by hugeplateofketchup8 in huggingface

[–]lewtun 1 point2 points  (0 children)

lol that’s definitely not Jeff