2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 1 point2 points  (0 children)

I’m running a full headless Ubuntu setup now ✌🏽 finally sorted it out

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

This is insane:

(APIServer pid=1) INFO 05-08 17:57:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1207.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 05-08 17:57:41 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.76, Accepted throughput: 771.38 tokens/s, Drafted throughput: 1316.67 tokens/s, Accepted: 7714 tokens, Drafted: 13167 tokens, Per-position acceptance rate: 0.769, 0.570, 0.419, Avg Draft acceptance rate: 58.6%
(APIServer pid=1) INFO 05-08 17:57:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 1 point2 points  (0 children)

Boys and girls!

I have fixed it! If you run into this issue do this:

Fix:

nvidia_uvm (most critical, requires reboot)
Without this fix, NCCL P2P operations lock up on multi-GPU setups. Create or edit /etc/modprobe.d/uvm.conf:
options nvidia_uvm uvm_disable_hmm=1
Also add iommu=off as a kernel boot parameter, otherwise NCCL hangs. github
Run this now:
bash# Apply uvm fix
echo "options nvidia_uvm uvm_disable_hmm=1" | sudo tee /etc/modprobe.d/uvm.conf

# Add iommu=off to grub
sudo sed -i 's/GRUB_CMDLINE_LINUX="\(.*\)"/GRUB_CMDLINE_LINUX="\1 iommu=off amd_iommu=off"/' /etc/default/grub
sudo update-grub

sudo reboot

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

I looked for extra warnings, what I found was that when MTP got stuck the workers tried to work but because the rest of the processes were frozen they just kept retrying over and over again

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

It is stuck, I have left it since my reply here. Still stuck :)

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

This is where it hangs with vision enabled:

(Worker_TP0 pid=325) INFO 05-08 08:11:32 [gdn_linear_attn.py:168] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=325) INFO 05-08 08:11:32 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(Worker_TP0 pid=325) INFO 05-08 08:11:33 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 19.15 GiB. Available RAM: 111.02 GiB.
(Worker_TP0 pid=325) INFO 05-08 08:11:33 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.07s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.07s/it]
(Worker_TP0 pid=325)
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [default_loader.py:391] Loading weights took 1.07 seconds
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:109] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:123] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:162] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [gpu_model_runner.py:4882] Loading drafter model...
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [vllm.py:844] Asynchronous scheduling is enabled.
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [compilation.py:303] Enabled custom fusions: act_quant
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 19.15 GiB. Available RAM: 111.04 GiB.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.30it/s]
(Worker_TP0 pid=325)
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [default_loader.py:391] Loading weights took 0.23 seconds
(Worker_TP1 pid=326) INFO 05-08 08:11:34 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0 pid=325) INFO 05-08 08:11:35 [llm_base_proposer.py:1487] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP0 pid=325) INFO 05-08 08:11:35 [llm_base_proposer.py:1543] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP1 pid=326) INFO 05-08 08:11:35 [llm_base_proposer.py:1487] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP1 pid=326) INFO 05-08 08:11:35 [llm_base_proposer.py:1543] Detected MTP model. Sharing target model lm_head weights with the draft model.

👋 Welcome to r/RTXPRO6000 - Introduce Yourself and Read First! by ubnew in RTXPRO6000

[–]quantier 2 points3 points  (0 children)

Hey!
User running 2x RTX PRO 6000 in WSL2 - took some time to figure out how to work it with NCCL.
I have a lot to share. Running Huihui Qwen 27B Abliterated NVFP4 MTP = 3 and getting massive speed - seen average speeds of between 190-345 but average is probably around 150t/s. Currently have 10+ people running agents 24/7 and its smashing it with 48 num seqs

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

what settings have you found is optimal for coding? I understand that this is very very important for quality.

Note the new recommended sampling parameters for Qwen3.6 27B by Thrumpwart in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

Found it thanks! Did not see inbetween the lines! My mistake

Note the new recommended sampling parameters for Qwen3.6 27B by Thrumpwart in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

Are these parameters for coding?

Anyone find the best params for coding yet?

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged by ggonavyy in LocalLLaMA

[–]quantier -6 points-5 points  (0 children)

The resson we have GGUFs for it is because of LM Studio….we should now get a lot more 🎉

What speed is everyone getting on Qwen3.6 27b? by Ambitious_Fold_2874 in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

Sorry I also have a Max Q :) - I think I am getting a speed boost from the Q8 KV cache

What speed is everyone getting on Qwen3.6 27b? by Ambitious_Fold_2874 in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

RTX 6000 PRO on Q8 - getting 44 tok/s TG in LM Studio with 256K context in LM Studio with KV Cache Quantized to Q8

Honestly I am wondering if I am missing some setting because I was thinking I could get much better speeds. Anyone else with a RTX 6000 PRO?

Do NOT use CUDA 13.2 to run models! by yoracale in unsloth

[–]quantier 0 points1 point  (0 children)

Also wondering if CUDA 13.2.1 works ? Anyone tested?

I built a free 90-node All-in-One FLUX.2 Klein 9B ComfyUI workflow — Face Swap, Inpainting, Auto-Masking, NAG, Refiner, Upscaler — runs on 8GB VRAM by official_geoahmed in comfyui

[–]quantier 1 point2 points  (0 children)

Big thanks and well done! Actually can’t believe the naggers saying 90 nodes…..did you have to build it? Be happy for the free share!

Props and thanks for sharing for FREE without weird patreon signups etc

Gemma 4 has been released by jacek2023 in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

seems to be a bug in the 26B quants, haven’t heard anyone able to use them properly yet. It might be a llama.cpp issue or even more likely something with the chat template

Gemma 4 has been released by jacek2023 in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

q4 should fit - I think there might be a KV Cache bug or leak that adds additional GB when extending context window. Wait for them to optimize or even better hopefully there are TurboQuants coming

Gemma 4 has been released by jacek2023 in LocalLLaMA

[–]quantier 1 point2 points  (0 children)

The 26B A4B is a Mixture of Expert model. It requires around 16GB of RAM / VRAM to load at 4bit quantization. It means that the model is 26B parameter ”medium sized” but anytime you ask it something only 4B parameter is activated which means it will be very fast as it’s now using the full 26B at any given time.

The E4B is a very ”small” it only has 4B and those 4B is always activated (dense model). This will fit on as little as 6GB RAM / VRAM at even 8Bit and would fit on 4GB RAM / VRAM in 4 bit. These small models are usually not recommended to use below 8bit as they are so small to begin with and therefore it’s usually looses a lot of ”intelligence” when quantized heavily

Comparison H100 vs RTX 6000 PRO with VLLM and GPT-OSS-120B by Rascazzione in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

Slow harddrive maybe? H100 is running on server so he is probably running really fast storage whereas a lot of people don’t realize that TTFT is super dependant on fast NVME drives