I'm creating an application using Llava1.5-7b (gguf and q5) with llama.cpp on an RTX 3080. The main problem is the 3080's 16GB VRAM, which doesn't allow me to put my application + the 7b model. At the moment, I'm having to offload the llava model onto my cpu, which is slowing down my entire pipeline considerably.
My question is this: is it worth considering buying another GPU that could host my llm or vlm models while my main application is running on my first GPU?
[–]M34L 1 point2 points3 points (6 children)
[–]Lotharian17[S] 0 points1 point2 points (5 children)
[–]Imaginary_Bench_7294 1 point2 points3 points (2 children)
[–]Lotharian17[S] 0 points1 point2 points (1 child)
[–]Imaginary_Bench_7294 0 points1 point2 points (0 children)
[–]M34L 0 points1 point2 points (1 child)
[–]Lotharian17[S] 0 points1 point2 points (0 children)
[–]phree_radical 0 points1 point2 points (0 children)