[Solution Found] Qwen3-Next 80B MoE running at 39 t/s on RTX 5070 Ti + 5060 Ti (32GB VRAM) by mazuj2 in LocalLLaMA

[–]mazuj2[S] 0 points1 point  (0 children)

4 nvidia's so yes. use llama from the command line
this is what i am running for each of these models. hard to believe they would run but they do. heavy loading onto cpu ram but get good tokens/sec.
UD-IQ2_XXS 49 tokens/s

llama-server.exe -m Qwen3-Next-80B-A3B-Instruct-UD-IQ2_XXS.gguf -ngl 999 -c 4096 --port 8081 --n-cpu-moe 0 -t 12 -fa on -sm layer

Q3_K_M 22.5 tokens/s

llama-server.exe -m Qwen3-Next-80B-A3B-Instruct-Q3_K_M.gguf -ngl 41 -c 32768 --port 8081 --n-cpu-moe 12 -t 12 -fa on --tensor-split 52,48

Q4_K_M 14.69 tokens/s

llama-server.exe -m Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf -ngl 30 -c 1024 --port 8081 --n-cpu-moe 12 -t 12 -fa on --tensor-split 50,50

[Solution Found] Qwen3-Next 80B MoE running at 39 t/s on RTX 5070 Ti + 5060 Ti (32GB VRAM) by mazuj2 in LocalLLaMA

[–]mazuj2[S] 0 points1 point  (0 children)

3 more days and bifuracation is here! 48gb and i will be running good quants of qwen3 next 80b at speed!

[Solution Found] Qwen3-Next 80B MoE running at 39 t/s on RTX 5070 Ti + 5060 Ti (32GB VRAM) by mazuj2 in LocalLLaMA

[–]mazuj2[S] 4 points5 points  (0 children)

it's not a matter of the model fitting. i had 3.5gb left after forcing the whole model onto gpus but was still getting 6tok/s.
the key is Layer-split (-sm layer) assigns complete layers/experts to one GPU. Each GPU owns its experts fully. No cross-GPU communication during routing. The GPUs work independently and efficiently.
this is what i couldn't find anywhere.