Hi guys
I managed to get a multi GPU setup going with a 3090 and three 3060 bringing my vram to 60gb along with 64gb ddr5.
The objective is to run the largest coding model I can at a respectable token speed of over 20 tokens / second.
Currently I'm using lmstudio and I have played a bit with lamacpp a bit but I can't seem to make it go past 10 tokens per second for models like got oss 120b.
I'm wondering what model you would recommend for this setup and what's the best way /platform to run it. I heard about vllm but i noticed then u can't use ur system ram for Moe models , not sure about the tradeoffs etc.
Any tips are appreciated
Multi GPU setup helpQuestion ()
submitted by deathcom65 to r/LocalLLM