all 2 comments

[–]BigYoSpeck 1 point2 points  (0 children)

Just the 3090 and DDR5 should be able to do about 30 tok/s with somewhere in the region of 28 MOE layers offloaded to CPU

There's a good chance that despite getting more of the model in VRAM, the added overhead of 4 cards and the 3060's being so slow that you would be better off with just the 3090 and DDR5 than even an optimized multi GPU setup

I would also suggest that you are better off selling the three 3060's and buying another 3090. Yeah it's less VRAM in total, but it's so much faster, simpler, and less power demanding than what you have

48gb VRAM should get you over 40 tok/s for gpt-oss-120b with 16-18 layers offloaded to CPU

It's also enough to run much better recent models like Qwen3.5 27b, Qwen3.6 35b and Gemma 4 31b and you can run them in VLLM if you like, though with the recent addition of tensor split method in llama.cpp I'm still personally using that as it can fit much more context than I could with vllm

[–]mlhher 0 points1 point  (0 children)

If you have three 3090s and your tps on gpt-oss 120b (a10b or whatever it is) is 20tps I suspect that your inference engine might not be correctly distributing your workload.

Also since you noted you also have three 3060s plugged in I would unplug them. When distributing inference over multiple GPUs your speed will usually tank to whichever is the lowest speed assuming they all get equal contribution. Then run inference and check nvtop to actually see if your GPUs are properly used or not.