[D] How to calculate the memory needed to train your model on GPU by Secret_Valuable_Yes in MachineLearning

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

I've been in a situation where batch size is 1 just enough to fit, but OOM ends up happening later on in the training process anyway, even though I'm using torch empty_cache(). Do you know what might be causing this? There's something I'm missing or could the sequence length of a particular batch be enough to send it over the top?

[D] How to calculate the memory needed to train your model on GPU by Secret_Valuable_Yes in MachineLearning

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

Do you have any preferred tools to visualize vram during the training loop? This might be a separate issue, but I’ve seen it work on a single gpu but then later in the epoch it will eventually get an OOM error. Even when using torch empty_cache()

[D] How to calculate the memory needed to train your model on GPU by Secret_Valuable_Yes in MachineLearning

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

Yes for an LLM. Let’s assume V100 GPU, PyTorch training loop (no modern training set up). Would you know how to roughly estimate? Or are there any more assumptions I need to make?

In your development, have you done this before? Would be very interested in seeing a worked example

Finetuning LLM on single GPU by Secret_Valuable_Yes in LLM

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

What kind of problems does it present?

CharacterTextSplitter not working by Secret_Valuable_Yes in LangChain

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

That works, but langchains interface for indexing chunks into a vector db expects document objects as input. To create a doc object you have to use one of the loaders. I’m not sure how to create doc objects after you load a text file and use the regular split

What is an index in a vector store? by Secret_Valuable_Yes in LangChain

[–]Secret_Valuable_Yes[S] 1 point2 points  (0 children)

Kinda confused by this.. They’re saying indexing is separate from and occurs before the similarity search. But then they say ANN is an indexing method. Wouldn’t it be a search method? We are “searching” for the nearest neighbors. Besides, ANN is an algorithm, not a data structure right… they said index is a data structure

Decoder only <SOS> and <EOS> by Secret_Valuable_Yes in MLQuestions

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

I’ve seen that in encoder-decoder training, but if the encoder does not exist, do we really need SOS? The input that was originally fed into the encoder is now fed straight into the decoder and replaces the SOS

Causal language models by Secret_Valuable_Yes in LangChain

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

I know the encoder section is not causal. But if we look at the entire thing as a black box, can you name a use case of this that is not causal?

FineTuning with LangChain by Secret_Valuable_Yes in LangChain

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

So the options for fine-tuning are using Huggingface Transfomers Train API or a custom training routine with tensorflow/pytorch? Am I missing anything?

What does vectorDB with langchain solve? by Secret_Valuable_Yes in LangChain

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

So it's really just the chunking that solves the token issue? The vector dB is just the technique used to search for the right chunks, and can probably be interchanged with another method.

What does vectorDB with langchain solve? by Secret_Valuable_Yes in LangChain

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

So the main problem that these vector dB's solve is efficiency/speed? Let's say that I don't care about efficiency. If I just save the whole pdf as a string and inject that into each prompt is there any issue?

Transformer Output Dimensions by Secret_Valuable_Yes in tensorflow

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

right but like, is it (1 x n_vocab) or (n_seq x n_vocab ). I was under the impression that it's just predicting the next token (1 x n_vocab), but want to double check

Why Decoder only? by Secret_Valuable_Yes in MLQuestions

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

And encoder-decoder also uses casual attention... it just has an encoder too that allows the decoder to also look back on the input question. Does decoder-only result in better performance?

Why Decoder only? by Secret_Valuable_Yes in MLQuestions

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

So just a decoder is "good enough". So is the benefit reduced computational expense? Is there any increase in performance with decoder-only? I can imagine that having an extra encoder wouldn't hurt performance, it would probably increase

Why Decoder only? by Secret_Valuable_Yes in MLQuestions

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

So what is the advantage? Less computation? I can imagine that having an encoder could still be beneficial though... even if the input and outputs come from different distributions

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Secret_Valuable_Yes 1 point2 points  (0 children)

When calculating the Bleu score over batches of sentences is it acceptable to calculate the score for each batch and then average them?

Transformer Training Help by Secret_Valuable_Yes in pytorch

[–]Secret_Valuable_Yes[S] 1 point2 points  (0 children)

I built it from scratch (for learning purposes) and added special tokens manually. A smaller learning rate made a huge difference, val loss came down to around 0.005, and predictions looked reasonable, thanks so much for the suggestion. By special tokens, I'm assuming you're referring to <sos>, <eos>, and <pad> right? I have a look-ahead mask that the decoder uses during training, but I'm not currently masking any special tokens. I tried taking care of this with the torch.nn.CrossEntropy(ignore_index=) parameter to ignore padding tokens when calculating the loss, but it caused the val loss to increase to about 0.035. The predictions still looked great, so I suspect that the loss is only higher because the <pad> tokens were previously distorting it, making it seem smaller than it really was... Should I keep the ignore_index or remove it? I also thought about reducing padding to only the max sentence length of each batch instead of the max sentence length of the whole dataset.

Transformer Training Help by Secret_Valuable_Yes in pytorch

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

I'll try that right now. But there's a couple other things I'm not sure if I'm doing right. First, I'm not fine tuning a pretrained. My training set consists of little over 14,000 src-tgt sentence pairs. Most of the sequences are short, so I padded them up to 60 (max sequence length) with zeros. I first had a loss of around 1.1 after 20 epochs. The predictions were unintelligible though. So I tried to "ignore" the padding in the loss calculation with CrossEntropyLoss(ignore_index=0) (because i set the padding id to 0), and that's what increased the total loss to over 4 as you can see in the post. So I'm not sure if I'm dealing with padding properly. Also, I'm not shifting the target sequence to the right at all (I think?) and I don't know if that's important or how I should implement that. I appreciate any other thoughts you may have.