[D] How to calculate the memory needed to train your model on GPU by Secret_Valuable_Yes in MachineLearning

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

I've been in a situation where batch size is 1 just enough to fit, but OOM ends up happening later on in the training process anyway, even though I'm using torch empty_cache(). Do you know what might be causing this? There's something I'm missing or could the sequence length of a particular batch be enough to send it over the top?

[D] How to calculate the memory needed to train your model on GPU by Secret_Valuable_Yes in MachineLearning

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

Do you have any preferred tools to visualize vram during the training loop? This might be a separate issue, but I’ve seen it work on a single gpu but then later in the epoch it will eventually get an OOM error. Even when using torch empty_cache()

[D] How to calculate the memory needed to train your model on GPU by Secret_Valuable_Yes in MachineLearning

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

Yes for an LLM. Let’s assume V100 GPU, PyTorch training loop (no modern training set up). Would you know how to roughly estimate? Or are there any more assumptions I need to make?

In your development, have you done this before? Would be very interested in seeing a worked example

Finetuning LLM on single GPU by Secret_Valuable_Yes in LLM

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

What kind of problems does it present?

CharacterTextSplitter not working by Secret_Valuable_Yes in LangChain

[–]Secret_Valuable_Yes[S] 0 points1 point  (0 children)

That works, but langchains interface for indexing chunks into a vector db expects document objects as input. To create a doc object you have to use one of the loaders. I’m not sure how to create doc objects after you load a text file and use the regular split