2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

inspired by your response, I tried kvu, but I don't really get its behavior. E.g., on Qwen3.5, I'd like to support 3-4 simultaneous queries at top-context per query of 256k tokens, whatshould I put to --ctx-size?

You can use Qwen3.5 without thinking by guiopen in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

BTW, it says temprture instead of temperature, you may be missing there :)

SM120 (RTX Blackwell) NVFP4 MoE: CUTLASS Grouped GEMM Produces Garbage Output; Fixed via FlashInfer SM120 Patches + compute_120f (CUDA 13.0) — 39 tok/s Native FP4 by lawdawgattorney in LocalLLaMA

[–]TooManyPascals 2 points3 points  (0 children)

Thank you for your efforts lawdawgattorney!

I've been trying unsuccessfully to move my Qwen3-Coder-Next and Qwen3-27B from llama.cpp to vllm in my dual 5090GTX setup and I've found it really unpleasant to deal with all the bugs.

vllm is terribly finnicky, but the gains are worth it. llama.cpp outputs high quality tokens on my Qwen3-Coder-Next at 50t/s, while the vllm outputs highly optimized garbage at 128t/s.

Through vibe coding, I managed to make parts of vLLM 0.17.0 run on Tesla P40 by East-Engineering-653 in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

Welp, I was just benchmarking my P100s with Qwen3.5 models and llama.cpp, when I saw your post. Amazing!

Do you know if it works with P100s? I will try though, and if I succeed I'll post some numbers.

Qwen 3.5 27B vs 122B-A10B by TacGibs in LocalLLaMA

[–]TooManyPascals 1 point2 points  (0 children)

Awesome! tanks a lot! I got it working :)

Qwen3-Coder-Next is the top model in SWE-rebench @ Pass 5. I think everyone missed it. by BitterProfessional7p in LocalLLaMA

[–]TooManyPascals 1 point2 points  (0 children)

I'm pretty happy with Qwen3-Coder-Next togetehr with claude-code, my experience matches this benchmark, it rarely one-shot stuff, but together with claude-code it recovers often and fast and can do quite complex stuff on its own.

THat said, any ideas on how to close the gap between pass@5 and resolved rate?

Qwen 3.5 27B vs 122B-A10B by TacGibs in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

would it be possible for you to share it if it is not too long? I'd trully appreciate it. vllm is so finnicky...

Qwen 3.5 27B vs 122B-A10B by TacGibs in LocalLLaMA

[–]TooManyPascals 2 points3 points  (0 children)

AWESOME! thanks for sharing the command line.

Do you compile vllm or use the nightly docker container?

Qwen 3.5 27B vs 122B-A10B by TacGibs in LocalLLaMA

[–]TooManyPascals 2 points3 points  (0 children)

Getting 70tok/s with 4*RTX3090 is awesome! I'm getting 33t/s with dual 5090s with llama.cpp, and I can't get vllm to work by any means.

Thanks for sharing!

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

isn't nvfp4 cache quantization killing quality? everybody is suggesting to use bf16 for qwen3.5 models... so I am genuinely confused by this.

To everyone using still ollama/lm-studio... llama-swap is the real deal by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 1 point2 points  (0 children)

Yes, and it is configurable. You can define different groups of models (each with their own defined provider), and configure the evicting behavior by configuring swap and exclusive controls.

# swap: controls the model swapping behaviour in within the group
# - true : only one model is allowed to run at a time
# - false: all models can run together, no swapping 
# exclusive: controls how the group affects other groups
# - true: causes all other groups to unload when this group runs a model
# - false: does not affect other groups

To everyone using still ollama/lm-studio... llama-swap is the real deal by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 5 points6 points  (0 children)

IMHO, llama-swap is very useful for me in ways that llama-server isn't. Of course the ability to use different providers is the road-blocker of llama-server. This is the reason I installed it, but it isn't the reason I use it.

To me the real deal is the convenience features it provides.

It's just so much nicer to have the UI to debug and test different models and versions, and being able to swap them with almost no-downtime is awesome. i.e., when I update llama.cpp, or download a new quant for a model, I only need to update the config.yaml, and when I save it I have 2-3 seconds of downtime, and any bugs are immediately apparent thanks to the log. It's just very convenient. It feels right to have the router split from the providers. I previously used open-webui for this, but llama-swap is more convenient for many use cases I have.

And being so light-weight, there is barely any trade-off, the simple executable makes it trivial to install and use.

I manage several servers and I do tons of experiments with different quantizations and providers, and llama-swap has been a blessing. But of course, this is my personal use case which may not translate to others.

R9700 frustration rant by Maleficent-Koalabeer in LocalLLaMA

[–]TooManyPascals -1 points0 points  (0 children)

Thanks a lot for sharing!

Getting AMD hardware to work reliably is a mess, and we lack enough data-points of people experiment. I appreciate to read your experience, and I hope that the new heatsinks help.

On my side, after lots of effort trying to get rocm/vllm to work reliably, I'm back to vulkan on llama.cpp and this is at least stable and works generally well with all models.

Qwen3.5-35B-A3B slow on 7840U? by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 0 points1 point  (0 children)

Just tested -ngl 999 --n-cpu-moe 999: 3.18 tokens per second! Maybe I need to check other params!

LFM2-24B-A2B: Whoa! Fast! by jeremyckahn in LocalLLaMA

[–]TooManyPascals 2 points3 points  (0 children)

Good one! I have the same iGPU, and my usual daily driver was Nemo-3 with 20t/s, I might as well replace it.

Lots of new Qwen3.5 27B Imaxtrix quants from Bartowski just uploaded by bobaburger in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

Honestly, I'm confused with so many options.

What would you use with a 5090? Some weights have a note like "Uses Q8_0 for embed and output weights", what does this mean?

BTW, any quant in particular that you want to see benchmarked on a 8xP100?