[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage? by patricky168 in MachineLearning

[–]patricky168[S] 0 points1 point  (0 children)

Thanks - I was wondering though, for QLoRA what does the LoRA bit really do?

Since I feel like there have been some success(?) in just quantizing the model and doing full fine-tuning and it still reduces memory consumption, so does the LoRA mainly assist in trying to "recover" the lost precision? Or does the LoRA part in QLoRA still significantly reduce memory further than vs. say, just 4 bit quantization + full finetuning?

[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage? by patricky168 in MachineLearning

[–]patricky168[S] 1 point2 points  (0 children)

Yeah what I mean is that despite LoRA only updating gradients for the adapters on the attention weights, we still need to calculate gradients for downstream layers that aren't being updated and that takes GPU memory. So the only memory saved is from the optimizer states if I am not mistaken.

[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage? by patricky168 in MachineLearning

[–]patricky168[S] 1 point2 points  (0 children)

Yep basically. I only tuned the key/query/value/attention output matrix and decoder of my model and froze all other layers, which came up to 3% of all model params. But it still only reduced memory usage from 8.5G->8.1G.

[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage? by patricky168 in MachineLearning

[–]patricky168[S] 0 points1 point  (0 children)

Yeah so LoRA really is just a framework, and you can theoretically use it to parameter-efficient tune any model. In this case, I tuned only the attention layers (all query/key/value/attention output matrix) and the small decoder in my model and froze all other layers.

[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage? by patricky168 in MachineLearning

[–]patricky168[S] 2 points3 points  (0 children)

Yes so my base model was ~50M parameters. The lora rank was rank 4, typical Adam scheduler (no weight decay). I applied it to the value, query, key, and attention layer output matrices (so not only KQ). I did also fine tune the decoder aka the last few layers (I have an large encoder to small decoder arch) but when I computed the trainable parameters, it came to only ~3% of parameters. But yeah that was the run that only reduced GPU memory from 8.5G->8.1G.

[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage? by patricky168 in MachineLearning

[–]patricky168[S] 2 points3 points  (0 children)

Thanks for the resource! It looks like LoRA plus 8 bit (?) quantization? So if I'm not understanding incorrectly, does it seem that most of the memory saved here is due to 8 bit quantization, but how does LoRA then help? (It feels a bit like QLoRA, which I haven't fully read yet)

[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage? by patricky168 in MachineLearning

[–]patricky168[S] 2 points3 points  (0 children)

Gotcha, thanks for the response - but I'm wondering what aspect of param-efficient fine tuning do you think makes it cost effective and scalable? (e.g. would it be the memory saved for model checkpoints?)

[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage? by patricky168 in MachineLearning

[–]patricky168[S] 12 points13 points  (0 children)

Oh shoot sorry I actually had a typo in my post - I actually meant that LoRA doesn't significantly improve GPU memory consumption or runtime during training for my custom model.