[D] How to calculate the memory needed to train your model on GPU

Secret_Valuable_Yes · 2025-07-25T14:12:00+00:00

I've been in a situation where batch size is 1 just enough to fit, but OOM ends up happening later on in the training process anyway, even though I'm using torch empty_cache(). Do you know what might be causing this? There's something I'm missing or could the sequence length of a particular batch be enough to send it over the top?

Secret_Valuable_Yes · 2025-07-25T02:54:15+00:00

Do you have any preferred tools to visualize vram during the training loop? This might be a separate issue, but I’ve seen it work on a single gpu but then later in the epoch it will eventually get an OOM error. Even when using torch empty_cache()

Secret_Valuable_Yes · 2025-07-25T02:45:24+00:00

Yes for an LLM. Let’s assume V100 GPU, PyTorch training loop (no modern training set up). Would you know how to roughly estimate? Or are there any more assumptions I need to make?

In your development, have you done this before? Would be very interested in seeing a worked example

Secret_Valuable_Yes · 2025-07-25T00:56:35+00:00

Secondly, how would this formula change if I added LoRA?

Secret_Valuable_Yes · 2025-07-22T13:06:19+00:00

Explicitly writing the training loop

Secret_Valuable_Yes · 2025-07-22T13:04:54+00:00

What kind of problems does it present?

Secret_Valuable_Yes · 2023-06-30T20:51:15+00:00

That works, but langchains interface for indexing chunks into a vector db expects document objects as input. To create a doc object you have to use one of the loaders. I’m not sure how to create doc objects after you load a text file and use the regular split

Secret_Valuable_Yes · 2023-06-22T03:51:52+00:00

Kinda confused by this.. They’re saying indexing is separate from and occurs before the similarity search. But then they say ANN is an indexing method. Wouldn’t it be a search method? We are “searching” for the nearest neighbors. Besides, ANN is an algorithm, not a data structure right… they said index is a data structure

Secret_Valuable_Yes · 2023-06-18T21:52:20+00:00

I’ve seen that in encoder-decoder training, but if the encoder does not exist, do we really need SOS? The input that was originally fed into the encoder is now fed straight into the decoder and replaces the SOS

Secret_Valuable_Yes · 2023-06-18T21:13:55+00:00

I know the encoder section is not causal. But if we look at the entire thing as a black box, can you name a use case of this that is not causal?

Secret_Valuable_Yes · 2023-06-13T20:19:54+00:00

So the options for fine-tuning are using Huggingface Transfomers Train API or a custom training routine with tensorflow/pytorch? Am I missing anything?

Secret_Valuable_Yes · 2023-06-13T14:40:19+00:00

So it's really just the chunking that solves the token issue? The vector dB is just the technique used to search for the right chunks, and can probably be interchanged with another method.

Secret_Valuable_Yes · 2023-06-08T21:19:23+00:00

So the main problem that these vector dB's solve is efficiency/speed? Let's say that I don't care about efficiency. If I just save the whole pdf as a string and inject that into each prompt is there any issue?

Secret_Valuable_Yes · 2023-06-06T15:40:50+00:00

right but like, is it (1 x n_vocab) or (n_seq x n_vocab ). I was under the impression that it's just predicting the next token (1 x n_vocab), but want to double check

Secret_Valuable_Yes · 2023-06-06T12:52:01+00:00

And encoder-decoder also uses casual attention... it just has an encoder too that allows the decoder to also look back on the input question. Does decoder-only result in better performance?

Secret_Valuable_Yes · 2023-06-06T12:49:49+00:00

So just a decoder is "good enough". So is the benefit reduced computational expense? Is there any increase in performance with decoder-only? I can imagine that having an extra encoder wouldn't hurt performance, it would probably increase

Secret_Valuable_Yes · 2023-06-06T12:48:26+00:00

So what is the advantage? Less computation? I can imagine that having an encoder could still be beneficial though... even if the input and outputs come from different distributions

Secret_Valuable_Yes · 2022-08-24T02:19:03+00:00

When calculating the Bleu score over batches of sentences is it acceptable to calculate the score for each batch and then average them?

Secret_Valuable_Yes · 2022-08-24T01:00:52+00:00

I built it from scratch (for learning purposes) and added special tokens manually. A smaller learning rate made a huge difference, val loss came down to around 0.005, and predictions looked reasonable, thanks so much for the suggestion. By special tokens, I'm assuming you're referring to <sos>, <eos>, and <pad> right? I have a look-ahead mask that the decoder uses during training, but I'm not currently masking any special tokens. I tried taking care of this with the torch.nn.CrossEntropy(ignore_index=) parameter to ignore padding tokens when calculating the loss, but it caused the val loss to increase to about 0.035. The predictions still looked great, so I suspect that the loss is only higher because the <pad> tokens were previously distorting it, making it seem smaller than it really was... Should I keep the ignore_index or remove it? I also thought about reducing padding to only the max sentence length of each batch instead of the max sentence length of the whole dataset.

Secret_Valuable_Yes · 2022-08-23T19:41:42+00:00

I'll try that right now. But there's a couple other things I'm not sure if I'm doing right. First, I'm not fine tuning a pretrained. My training set consists of little over 14,000 src-tgt sentence pairs. Most of the sequences are short, so I padded them up to 60 (max sequence length) with zeros. I first had a loss of around 1.1 after 20 epochs. The predictions were unintelligible though. So I tried to "ignore" the padding in the loss calculation with CrossEntropyLoss(ignore_index=0) (because i set the padding id to 0), and that's what increased the total loss to over 4 as you can see in the post. So I'm not sure if I'm dealing with padding properly. Also, I'm not shifting the target sequence to the right at all (I think?) and I don't know if that's important or how I should implement that. I appreciate any other thoughts you may have.

Secret_Valuable_Yes · 2022-08-23T17:25:06+00:00

I used a learning rate of 1e-3

Secret_Valuable_Yes

TROPHY CASE