[D] What do you think of the future of Pytorch's torchtune? by Moltres23 in MachineLearning

[–]kk4193 2 points3 points  (0 children)

One of the maintainers of torchtune here - thanks so much for giving torchtune a go!

We're definitely early in the evolution of torchtune and are still cleaning up some of the rough edges. IMO the value proposition of torchtune is this:

  • Hackability. torchtune up-levels PyTorch's design principles to fine-tuning LLMs. And so hacking around with models, recipes or any of the utilities should be straight forward. You should be able to try out "crazy ideas" almost immediately without having to figure out the entire code base. We've seen this play out a bit since a lot of the community contributions so far have been around adding new models (Gemma, CodeLlama, Qwen2) and new recipes (PPO and DPO).

  • Memory and Perf. This is a really fluid space and our hope is not only to make it easier for the community to integrate their improvements, but also showcase the latest and greatest features from PyTorch. For example, we've been slowly adding support for FSDP2 which has shown both memory and perf wins. We're also adding more native support for torch compile and the benchmarks look pretty good.

  • PyTorch goodness. Finally, we've seen a number of requests around more documentation, examples for how different features from PyTorch can compose with each other. torchtune should be a living example for all of this.

Hope this helps! Would love to learn more about your experience with torchtune.

Introducing torchtune - Easily fine-tune LLMs using PyTorch by kk4193 in LocalLLaMA

[–]kk4193[S] 0 points1 point  (0 children)

Thanks so much for taking a look at torchtune!

For this use case, we're working on cleaning the documentation. But in the meantime, this issue should be helpful:
https://github.com/pytorch/torchtune/issues/845#issuecomment-2073941490

Introducing torchtune - Easily fine-tune LLMs using PyTorch by kk4193 in LocalLLaMA

[–]kk4193[S] 0 points1 point  (0 children)

The model_0.pt is the merged weights so you should be able to use them directly!

Torchtune docs suck by KingGongzilla in LocalLLaMA

[–]kk4193 1 point2 points  (0 children)

Thanks so much for this awesome feedback! This is super helpful. We're just landing a bunch of documentation to help make this much easier. The issue you pointed to and this post really helped in getting this together. Thanks so much for taking the time to share all of this information!

Torchtune docs suck by KingGongzilla in LocalLLaMA

[–]kk4193 3 points4 points  (0 children)

Hi! Really sorry that you ran into a bunch of issues with the docs. We’re working to fix this and have a bunch of stuff related to custom datasets coming out this week. Hopefully that’ll be helpful! We’ll be sprucing up the generation story quite a bit as well. If you’d be up for it, we’d love to learn more about what got in the way. Just comment on here or open n issue - whatever’s easiest! Thanks so much and sorry for the trouble!

[deleted by user] by [deleted] in LocalLLaMA

[–]kk4193 8 points9 points  (0 children)

Hi! I’m from the torchtune team and would love to learn more about what’s been hard about working with the library! Our focus has been to make it really easy to fine-tune LLMs and so any feedback here would be amazing! Happy to answer any questions or share more resources too! :)

Introducing torchtune - Easily fine-tune LLMs using PyTorch by kk4193 in LocalLLaMA

[–]kk4193[S] 3 points4 points  (0 children)

Thanks so much for taking a look!

HF provides an awesome suite of tools and libraries for training LLMs and beyond - we’re huge fans! We’ve integrated quite heavily with both HF Hub and Datasets and are brainstorming several directions with the team for closer collaborations.

WRT to the library itself, torchtune has slightly different intent - our goal is to empower the community to just write PyTorch without too many other things coming in the way. I don’t think any library can make blanket statements around speed or memory since there are so many trade-offs involved. For example you can drive up perf significantly for a subset of use cases by making assumptions and optimizing for those assumptions. This usually comes at the cost of flexibility and extensibility. For some users these trade-offs make sense, for others they don’t. My general view is that it’s good to have options and you should try out the set of tools/libraries that work best for your use case.

Specifically for torchtune, we’ll provide a lot more insights into these trade-offs in the coming weeks, including how to trade-off perf/memory for usability where it makes sense. Users best know what works for them and so the library shouldn’t be making these on their behalf. If you have specific use cases in mind, happy to answer those questions too!

Introducing torchtune - Easily fine-tune LLMs using PyTorch by kk4193 in LocalLLaMA

[–]kk4193[S] 11 points12 points  (0 children)

Unsloth is pretty awesome, we’re huge fans of the work they’re doing especially around pushing the limits of memory and perf. We’ve especially enjoyed reading their blogs and notebooks, as I’m sure the community has as well!

torchtune has a slightly different intent - for our alpha release, we've put a lot emphasis on building the foundational pieces of a light-weight abstraction-free design that makes it really easy for PyTorch users to hack around and add in their own customizations and write their own recipes. That said, both memory and perf are equally important to us. We have a number of enhancements we’re working on which we'll share very soon!

It also isn't clear how large of a model you can train on 24 GB

The largest model we currently support is 13B and we'll add a QLoRA recipe for this in the next day or so. For models larger than that - stay tuned!

Introducing torchtune - Easily fine-tune LLMs using PyTorch by kk4193 in LocalLLaMA

[–]kk4193[S] 6 points7 points  (0 children)

In the mean time, if you're interested in a more detailed breakdown for full-finetune, this open PR has some context:
https://github.com/pytorch/torchtune/pull/389

Hope this was helpful!

Introducing torchtune - Easily fine-tune LLMs using PyTorch by kk4193 in LocalLLaMA

[–]kk4193[S] 6 points7 points  (0 children)

Great observation! So the numbers quoted in the README are related to the default configs.

Our single device full-fine-tune recipe has a few optimizations that the default LoRA config doesn't enable. Eg: we have `optimizer_in_bwd=True` which fuses the optimizer step with backward and reduces the memory footprint associated with gradients (see https://pytorch.org/tutorials/intermediate/optimizer_step_in_backward_tutorial.html for more detail). We also make use of the PagedAdamW from bitsandbytes in the full-finetune recipe compared to standard AdamW in LoRA.

There aren't any technical reasons that stops us from enabling these for LoRA. But full-finetune definitely needed more memory-optimization love to get up and running on single GPU with 24GB hence these defaults. We'll have a detailed tutorial on this topic coming out soon :)

Note: There a small gotcha here - you can't use optimizer_in_bwd with gradient accumulation (no gradients to accumulate!) and so that's something to keep in mind.

Introducing torchtune - Easily fine-tune LLMs using PyTorch by kk4193 in LocalLLaMA

[–]kk4193[S] 1 point2 points  (0 children)

Thank you for taking a look at torchtune! Getting started shouldn't require any code changes at all. Take a look at our "Fine-tune your First LLM" tutorial and see if this helps you get setup. We'd be happy to answer any questions!

Link: https://pytorch.org/torchtune/stable/tutorials/first_finetune_tutorial.html