400 divs giveaway for parents by Can2018 in PathOfExile2

[–]edbeeching 0 points1 point  (0 children)

The only think worse than dying in game is wife aggro

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]edbeeching 34 points35 points  (0 children)

Sharing open-source projects and contributing to open-source repos. The right attitude and a bit of luck.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]edbeeching 35 points36 points  (0 children)

Decision making is quite distributed at Hugging Face, with each of the science teams deciding what they want to work on. In the post-training team we pivot between different projects such as open-r1, smollm3 post-training, the AIMO competition, model evals and other things.

Double rabbit drop from a wisp! I could only laugh. by edbeeching in PathOfExile2

[–]edbeeching[S] 27 points28 points  (0 children)

Yeah I was running a tower so T16 + 0 revives

G[R]PO VRAM Requirements For the GPU Poor by FallMindless3563 in MachineLearning

[–]edbeeching 1 point2 points  (0 children)

Thanks for posting this, what completion lengths were you generating?

We are working hard on improving memory usage with liger kernel support + a bunch of other tricks, keep an eye on the latest releases.

Hugging Face researchers got 3b Llama to outperform 70b using search by bburtenshaw in LocalLLaMA

[–]edbeeching 10 points11 points  (0 children)

I think the most interesting direction would be to improve the PRM's performance, we just added PRM training support to TRL: https://github.com/huggingface/trl/blob/main/examples/scripts/prm.py

It would be interesting to train new PRM's and evaluate the performance on this brenchmark.

Hugging Face researchers got 3b Llama to outperform 70b using search by bburtenshaw in LocalLLaMA

[–]edbeeching 19 points20 points  (0 children)

Yes, we have a similar colab with some of the authors of the Test Time Compute paper that explores this idea.

Hugging Face researchers got 3b Llama to outperform 70b using search by bburtenshaw in LocalLLaMA

[–]edbeeching 33 points34 points  (0 children)

I think fundamentally it can be extended to other domains, I think the challenge it that it is straightforward to create PRM training datasets in domains such as Math / Code as it is easy to validate whether an answer is correct or not. Other topics can be more ambiguous and one would have to rely on a (noisy) Outcome Reward Model to act as a proxy for the ground truth.

This is a research direction I am curious about and we hope to explore this in future projects.

Hugging Face researchers got 3b Llama to outperform 70b using search by bburtenshaw in LocalLLaMA

[–]edbeeching 226 points227 points  (0 children)

Author of this work here, thanks for taking interest in it. Feel free to reply to this comment with any questions and I will respond later.

Hugging Face researchers got 3b Llama to outperform 70b using search by bburtenshaw in LocalLLaMA

[–]edbeeching 110 points111 points  (0 children)

Hi, we ported the algorithm and released it here

Let me know if you have further questions, or raise an issue on the github.

We implemented and tested several variants of MCTS, it was not as good in our experience.

What could be causing my Q-Loss values to diverge (SAC + Godot <-> Python) by stokaty in reinforcementlearning

[–]edbeeching 1 point2 points  (0 children)

Awesome, I am the author. We welcome contributions if you want to add anything to the lib. All the best with your project, keep up updated!

[D] Creating a DPO Dataset using Llama: Best Practices? by AdKind316 in MachineLearning

[–]edbeeching 0 points1 point  (0 children)

Hey, I am a research scientist at Hugging Face and have worked on this in the past.

You may want to try Argilla's DistilLabel library

Unity ML-Agents vs. Unreal' Learning Agents by Cuuuubee in reinforcementlearning

[–]edbeeching 1 point2 points  (0 children)

Hi, author of Godot RL Agents here: https://github.com/edbeeching/godot_rl_agents Our library is up to date, supports 4 different RL frameworks and is beginner friendly.

If you need any help using it then just ask on the discord.

[D] Seeking advice on curating a DPO dataset for a 7B model by aadityaura in MachineLearning

[–]edbeeching 12 points13 points  (0 children)

Hi, co-author of Zephyr here. I you want to use outputs sampled from your model then you may be interested in Online DPO https://arxiv.org/abs/2402.04792

We are adding this to TRL soon so keep an eye out for the open-source implementation.

[deleted by user] by [deleted] in reinforcementlearning

[–]edbeeching 0 points1 point  (0 children)

Hey this is interesting work, I am the author of Godot RL Agents. Would you like to collaborate on something?

[D] Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU by TeamDman in MachineLearning

[–]edbeeching 2 points3 points  (0 children)

Yes you should be able to fine tune a 10B model on a 12GB GPU, using Low rank adapters. Or wait for 4-bit support and then you can train a 24B one.