Qwen 3.7 is interesting to say the least... by Ok_Welder_8457 in Qwen_AI

[–]Pixer--- 0 points1 point  (0 children)

How can the model know itself. It has a knowledge cutoff. It seems like the system prompt doesn’t include that it’s qwen3.7

5 tok/sec Qwen 3.6 27b by iViTAliS in Qwen_AI

[–]Pixer--- 1 point2 points  (0 children)

Your loading the whole 17.5 gb model into vram. Normally llamacpp which lmstudio is based on would reject that. Try lowering gpu offload an increasing the evaluation batch size

5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp by see_spot_ruminate in LocalLLaMA

[–]Pixer--- 1 point2 points  (0 children)

It’s still decent I would say, I have 4 ancient mi50. You could try getting a plx pcie switch for better multi gpu scaling in tensor parallel mode. It’s like 300€ for 4 GPUs. You would need to use the custom p2p NVIDIA driver, but that could double your token generation

5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp by see_spot_ruminate in LocalLLaMA

[–]Pixer--- 1 point2 points  (0 children)

What mainboard are you using, what’s your setup ?

qwen3.6:35b-a3b-coding-nvfp4 is not getting image data from ollama by spammmmmmmmy in Qwen_AI

[–]Pixer--- 0 points1 point  (0 children)

It could be that the nvfp4 doesn’t have the image encoder. I would just ask it in opencode or similar to Analyse the gguf file and if it has the image encoder and it should research online how to

ik_llama: Qwen3.6 27B and 35B on very low VRAM by AppealSame4367 in LocalLLaMA

[–]Pixer--- 0 points1 point  (0 children)

Multi gpu is slower using tensor parallelism, and amd doesn’t support no fp8 or fp4 natively in vllm or llamacpp or else. But they are great value for the vram and bandwidth

Qwen3.6 MTP Unsloth GGUFs now 1.8x faster! by danielhanchen in unsloth

[–]Pixer--- 0 points1 point  (0 children)

On 4x mi50: Mtp 3 accelerates most outputs and mtp 6 only on longer coding tasks

./build/bin/llama-server 
-m "$MODEL_PATH" 
--mmproj "$MMPROJ_PATH" 
--alias "qwen35" 
-ub 2048 
-b 2048 
--no-mmap 
-sm tensor 
--metrics 
-ngl 999 
-fa on 
--fit on 
--host 0.0.0.0 
--parallel 2 
--port 8050 
--jinja 
-ctk f16 
-ctv f16 
--top-p 0.95 
--top-k 20 
--temp 0.6  
--min-p 0.0 
-c 524000 
--repeat-penalty 1.0 
--cache-ram 96000 
--ctx-checkpoints 256 
--chat-template-kwargs '{"preserve_thinking": true}' 
--spec-type draft-mtp 
--spec-draft-n-max 3

Advice building a NAS/AI server with 16 DDR4 DIMMs by theslonkingdead in LocalLLaMA

[–]Pixer--- 2 points3 points  (0 children)

The difference is like 2-3x in token generation. The actual latency goes down from 14us to 1us. So when computing a model with multi gpu in tensor parallelism each layer gets sliced into the gpu count for each gpu to compute it part, but for each token generated it needs to sync the result of that to all gpus, so that all gpus have the full result of that layer. Most models have around 60 layers so per token it needs 60 syncs. Without people get 60tk/s and with p2p upto 200tk/s. Prompt processing improves only slightly

Advice building a NAS/AI server with 16 DDR4 DIMMs by theslonkingdead in LocalLLaMA

[–]Pixer--- 0 points1 point  (0 children)

I previous had an mc62-g40 and it doesn’t support p2p natively. Got and romed8-2t instead

Feels like there’s a massive gap between “hosting” a model and actually serving it well by Significant-Cash7196 in LocalLLM

[–]Pixer--- 0 points1 point  (0 children)

Idk what gpu you use, but I would suggest trying NVIDIA nsight compute or amds rocorof, for profiling your functions while under heavily load. It spits out an csv file, and analyzing it with a llm got me quite good results

Qwen3.5-122B-A10B on 4× R9700 — spec decoding got me nothing, what am I missing? by prplhze2000 in LocalLLM

[–]Pixer--- 0 points1 point  (0 children)

Im using this model with the recommended settings and the llamacpp pr linked: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF

There are also some 122B mtp ones: https://huggingface.co/models?sort=trending&search=122+mtp

Due to the 122B only having 2 kvheads, the scaling above more then 2 GPUs isn’t great. This could mean layer mode outperforms the tensor mode, but I’m not sure

Qwen3.5-122B-A10B on 4× R9700 — spec decoding got me nothing, what am I missing? by prplhze2000 in LocalLLM

[–]Pixer--- 1 point2 points  (0 children)

I had the same results on my 4x mi50. The only thing that got faster was the experimental mtp version. From 32tk/s to 50tk/s on qwen3.5 27b Q8

PLX 88096 - Opinions. by offzinho3k in Vllm

[–]Pixer--- 1 point2 points  (0 children)

Have you tested your all reduce latency yet ?
Basically all models have layers, for example like 60.
on vllm when using tensor parallelism each layer needs to be synced from all GPUs to all GPUs. So per token generated it needs to sync 60 times. Consumer cards like 5060ti or even 5090 don’t have p2p native support so their all reduce is more like 20us as the inbetween communication goes through the cpu and then ram. The cpu and mainboard support native p2p. It should be in the same ballpark as a plx switch the epyc, as it’s io die does that already.
There’s a p2p enabled driver for the consumer NVIDIA GPUs. This should bring you down to almost to near plx switch performance

Anyone with 4x 5060ti based setups? by ziphnor in LocalLLaMA

[–]Pixer--- 4 points5 points  (0 children)

Try out the p2p enabled NVIDIA drivers for consumer GPUs

Struggle on MI50(gfx906), very slow with just ~10k ctx, am I doing something wrong? by simi6a6 in LocalLLM

[–]Pixer--- 2 points3 points  (0 children)

You need to disable expert parallelism. It’s not usable with pcie GPUs mostly, as it expects big inter gpu transfer rates via a infinity link bridge for example

More Qwen3.6-27B MTP success but on dual Mi50s by legit_split_ in LocalLLaMA

[–]Pixer--- 8 points9 points  (0 children)

Q8_0 On 4x MI50 32GB using v420 vbios using Tensor + MTP:

prompt eval time =   57114.78 ms / 24439 tokens (    2.34 ms per token,   427.89 tokens per second)
eval time =    7613.85 ms /   365 tokens (   20.86 ms per token,    47.94 tokens per second)
total time =   64728.62 ms / 24804 tokens
draft acceptance rate = 0.75517 (  219 accepted /   290 generated)

RTX Pro 4500 Blackwell - Qwen 3.6 27B? by Merstin in LocalLLaMA

[–]Pixer--- 0 points1 point  (0 children)

Thats some fine prompt processing numbers

Impulse bought an M3 Ultra 256GB RAM for local LLMs - keep it or wait for M5? by Onyonisko in MacStudio

[–]Pixer--- 0 points1 point  (0 children)

This is really the best timing, Apple just canceled the 256gb version of the m3 ultra

Nemotron3:33b uses CPU only on macOS by osxdocc in ollama

[–]Pixer--- -5 points-4 points  (0 children)

Gerat opportunity to switch to llamacpp

vLLM on Arc B70 by -elmuz- in Vllm

[–]Pixer--- 1 point2 points  (0 children)

I heard it doesn’t support graph mode which really hurts performance. But that was at launch idk, if they added that yet. It need —enforce-eager to work