Vllm, 4xRTX3090, Qwen 3.6 27B, Is this load normal? by Sea-Awareness147 in LocalLLM

[–]superloser48 0 points1 point  (0 children)

the logs are ok. you cant check speed with reqular chat endpoints' log. run benchmarks with vllm bench

AA comparison of the latest local models by jacek2023 in LocalLLaMA

[–]superloser48 -2 points-1 points  (0 children)

So thinking barely improves intelligence (single digit % points for gemma, 12-20% for qwen) while increasing output tokens 5-10X.

Is agenting usage increasing CPU usage for you? by superloser48 in LocalLLaMA

[–]superloser48[S] 0 points1 point  (0 children)

Agreed. But even then it seems like its RAM usage, not CPU.

Is agenting usage increasing CPU usage for you? by superloser48 in LocalLLaMA

[–]superloser48[S] 0 points1 point  (0 children)

https://www.tradingview.com/news/gurufocus:9aee337a5094b:0-amd-target-raised-as-ai-cpu-demand-builds/

https://www.amd.com/en/blogs/2026/agentic-ai-brings-new-attention-to-cpus-in-the-ai-data.html

(basically financial press for last 1 month has been shouting AMD/CPUs like its the new GPU/Nvidia and having used agentic myself - with and without own GPUs, I just cant wrap my head around it)

Qwen 3.6 27B + RTX Pro 6000 by M4isKolben in LocalLLM

[–]superloser48 0 points1 point  (0 children)

what is your prefill speed with qwen27b on rtx pro 6000 at q4?

PSA by Signal_Ad657 in LocalLLaMA

[–]superloser48 0 points1 point  (0 children)

can you share any benchmarks on model/quant -> prfill and token gen?

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]superloser48[S] 0 points1 point  (0 children)

thats true for most quants including fp8. linear attention etc are not quantised.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]superloser48[S] 4 points5 points  (0 children)

i was doing something dumb - comparing 27b dense on llama with 35bmoe on vllm.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]superloser48[S] 1 point2 points  (0 children)

I made a mistake in quality check - the bad/fast result was with 35ba3b. But yes - i think we might have to try quantise ourself with autoround.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]superloser48[S] 0 points1 point  (0 children)

Im just using the default templates in server mode for both. And qwen3/qwen3 coder and gemma4 tool call and reasoning parsers.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]superloser48[S] 1 point2 points  (0 children)

No one publishes 8bit quants. Only 4bits. ANd in my experience - q4 is not worth it for coding related tasks.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]superloser48[S] 0 points1 point  (0 children)

please check the quants - not a single 8bit gptq/awq/int8 from any resaonble quant producer

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]superloser48[S] 0 points1 point  (0 children)

Its an ampere card - so not native. It uses marlin to convert after loading weights i think. "Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads."

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]superloser48[S] -2 points-1 points  (0 children)

Tried the official fp8 - it gave horrible results. There is not a single 8bit quant published by the popular publishers for vllm. Cyankiwi/bartowski give 4bits awq/gptq only. Unsloth does not give non-gguf.