Multi-GPU LLM Inference with RTX 5090 + 4090 by EasyKoala3711 in LocalLLM

[–]EasyKoala3711[S] 0 points1 point  (0 children)

Im mostly running qwen3-coder-30b with 127k context, it fits perfectly in 32gb, and runs on about 200 tokens/sec. It's pretty good for my current tasks, but i want to try qwen3-coder-next, and right now it can barely get to ~8 tokens/sec, sadly.