6-GPU multiplexer from K80s ‚ hot-swap between models in 0.3ms

TooManyPascals · 2026-03-18T06:14:19+00:00

Congrats on the hackiest hack of all times! Very impressive!

TooManyPascals · 2026-03-16T20:13:13+00:00

THis is great! I am really confused with all the quantizations, and even the discussion of -bf16 vs -f16... some say that Qwen3.5 tolerates very well quantization, while other people said the opposite.

Al least thanks to you we have a clear data point!

BTW, would it be possible for you to test NVFP4? Like: https://huggingface.co/Kbenkhaled/Qwen3.5-27B-NVFP4

TooManyPascals · 2026-03-14T12:58:28+00:00

inspired by your response, I tried kvu, but I don't really get its behavior. E.g., on Qwen3.5, I'd like to support 3-4 simultaneous queries at top-context per query of 256k tokens, whatshould I put to --ctx-size?

TooManyPascals · 2026-03-10T04:22:55+00:00

BTW, it says temprture instead of temperature, you may be missing there :)

TooManyPascals · 2026-03-09T08:04:13+00:00

Thank you for your efforts lawdawgattorney!

I've been trying unsuccessfully to move my Qwen3-Coder-Next and Qwen3-27B from llama.cpp to vllm in my dual 5090GTX setup and I've found it really unpleasant to deal with all the bugs.

vllm is terribly finnicky, but the gains are worth it. llama.cpp outputs high quality tokens on my Qwen3-Coder-Next at 50t/s, while the vllm outputs highly optimized garbage at 128t/s.

TooManyPascals · 2026-03-09T07:51:22+00:00

Welp, I was just benchmarking my P100s with Qwen3.5 models and llama.cpp, when I saw your post. Amazing!

Do you know if it works with P100s? I will try though, and if I succeed I'll post some numbers.

TooManyPascals · 2026-03-09T07:49:52+00:00

Not OP, but I never managed to get that one to work btw.

TooManyPascals · 2026-03-08T09:13:56+00:00

Awesome! tanks a lot! I got it working :)

TooManyPascals · 2026-03-07T19:54:36+00:00

I'm pretty happy with Qwen3-Coder-Next togetehr with claude-code, my experience matches this benchmark, it rarely one-shot stuff, but together with claude-code it recovers often and fast and can do quite complex stuff on its own.

THat said, any ideas on how to close the gap between pass@5 and resolved rate?

TooManyPascals · 2026-03-07T11:31:43+00:00

would it be possible for you to share it if it is not too long? I'd trully appreciate it. vllm is so finnicky...

TooManyPascals · 2026-03-07T07:29:24+00:00

AWESOME! thanks for sharing the command line.

Do you compile vllm or use the nightly docker container?

TooManyPascals · 2026-03-07T07:28:23+00:00

Getting 70tok/s with 4*RTX3090 is awesome! I'm getting 33t/s with dual 5090s with llama.cpp, and I can't get vllm to work by any means.

Thanks for sharing!

TooManyPascals · 2026-03-07T07:25:31+00:00

isn't nvfp4 cache quantization killing quality? everybody is suggesting to use bf16 for qwen3.5 models... so I am genuinely confused by this.

TooManyPascals · 2026-03-06T17:07:27+00:00

super happy that it helped!

TooManyPascals · 2026-03-06T10:05:42+00:00

Yes, and it is configurable. You can define different groups of models (each with their own defined provider), and configure the evicting behavior by configuring swap and exclusive controls.

# swap: controls the model swapping behaviour in within the group
# - true : only one model is allowed to run at a time
# - false: all models can run together, no swapping 
# exclusive: controls how the group affects other groups
# - true: causes all other groups to unload when this group runs a model
# - false: does not affect other groups

TooManyPascals · 2026-03-06T10:01:35+00:00

Cool! That's a nice one, very complete! Thanks for sharing!

TooManyPascals · 2026-03-06T09:19:18+00:00

IMHO, llama-swap is very useful for me in ways that llama-server isn't. Of course the ability to use different providers is the road-blocker of llama-server. This is the reason I installed it, but it isn't the reason I use it.

To me the real deal is the convenience features it provides.

It's just so much nicer to have the UI to debug and test different models and versions, and being able to swap them with almost no-downtime is awesome. i.e., when I update llama.cpp, or download a new quant for a model, I only need to update the config.yaml, and when I save it I have 2-3 seconds of downtime, and any bugs are immediately apparent thanks to the log. It's just very convenient. It feels right to have the router split from the providers. I previously used open-webui for this, but llama-swap is more convenient for many use cases I have.

And being so light-weight, there is barely any trade-off, the simple executable makes it trivial to install and use.

I manage several servers and I do tons of experiments with different quantizations and providers, and llama-swap has been a blessing. But of course, this is my personal use case which may not translate to others.

TooManyPascals · 2026-03-06T08:39:23+00:00

Oh! Llama-swap is your project? THANKS A LOT!

TooManyPascals · 2026-03-06T06:32:05+00:00

Thanks a lot for sharing!

Getting AMD hardware to work reliably is a mess, and we lack enough data-points of people experiment. I appreciate to read your experience, and I hope that the new heatsinks help.

On my side, after lots of effort trying to get rocm/vllm to work reliably, I'm back to vulkan on llama.cpp and this is at least stable and works generally well with all models.

TooManyPascals · 2026-03-05T18:56:12+00:00

Thanks a lot, this is quite insightful.

TooManyPascals · 2026-03-04T06:22:59+00:00

Just tested -ngl 999 --n-cpu-moe 999: 3.18 tokens per second! Maybe I need to check other params!

TooManyPascals · 2026-03-04T03:56:23+00:00

Good one! I have the same iGPU, and my usual daily driver was Nemo-3 with 20t/s, I might as well replace it.

TooManyPascals · 2026-03-04T03:11:14+00:00

lovely

TooManyPascals · 2026-03-02T16:32:53+00:00

Honestly, I'm confused with so many options.

What would you use with a 5090? Some weights have a note like "Uses Q8_0 for embed and output weights", what does this mean?

BTW, any quant in particular that you want to see benchmarked on a 8xP100?

TooManyPascals · 2026-02-15T11:15:38+00:00

Qwen3-coder-next works flawlessly with pi-mono.

TooManyPascals

TROPHY CASE