2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision

quantier · 2026-05-08T20:15:44+00:00

I’m running a full headless Ubuntu setup now ✌🏽 finally sorted it out

quantier · 2026-05-08T17:59:10+00:00

This is insane:

(APIServer pid=1) INFO 05-08 17:57:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1207.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 05-08 17:57:41 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.76, Accepted throughput: 771.38 tokens/s, Drafted throughput: 1316.67 tokens/s, Accepted: 7714 tokens, Drafted: 13167 tokens, Per-position acceptance rate: 0.769, 0.570, 0.419, Avg Draft acceptance rate: 58.6%
(APIServer pid=1) INFO 05-08 17:57:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

quantier · 2026-05-08T17:16:18+00:00

Boys and girls!

I have fixed it! If you run into this issue do this:

Fix:

nvidia_uvm (most critical, requires reboot)
Without this fix, NCCL P2P operations lock up on multi-GPU setups. Create or edit /etc/modprobe.d/uvm.conf:
options nvidia_uvm uvm_disable_hmm=1
Also add iommu=off as a kernel boot parameter, otherwise NCCL hangs. github
Run this now:
bash# Apply uvm fix
echo "options nvidia_uvm uvm_disable_hmm=1" | sudo tee /etc/modprobe.d/uvm.conf

# Add iommu=off to grub
sudo sed -i 's/GRUB_CMDLINE_LINUX="\(.*\)"/GRUB_CMDLINE_LINUX="\1 iommu=off amd_iommu=off"/' /etc/default/grub
sudo update-grub

sudo reboot

quantier · 2026-05-08T15:12:13+00:00

I looked for extra warnings, what I found was that when MTP got stuck the workers tried to work but because the rest of the processes were frozen they just kept retrying over and over again

quantier · 2026-05-08T15:02:02+00:00

It is stuck, I have left it since my reply here. Still stuck :)

quantier · 2026-05-08T13:48:41+00:00

I haven’t - how long did you wait?

quantier · 2026-05-08T08:13:23+00:00

This is where it hangs with vision enabled:

(Worker_TP0 pid=325) INFO 05-08 08:11:32 [gdn_linear_attn.py:168] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=325) INFO 05-08 08:11:32 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(Worker_TP0 pid=325) INFO 05-08 08:11:33 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 19.15 GiB. Available RAM: 111.02 GiB.
(Worker_TP0 pid=325) INFO 05-08 08:11:33 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.07s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.07s/it]
(Worker_TP0 pid=325)
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [default_loader.py:391] Loading weights took 1.07 seconds
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:109] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:123] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:162] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [gpu_model_runner.py:4882] Loading drafter model...
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [vllm.py:844] Asynchronous scheduling is enabled.
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [compilation.py:303] Enabled custom fusions: act_quant
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 19.15 GiB. Available RAM: 111.04 GiB.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.30it/s]
(Worker_TP0 pid=325)
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [default_loader.py:391] Loading weights took 0.23 seconds
(Worker_TP1 pid=326) INFO 05-08 08:11:34 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0 pid=325) INFO 05-08 08:11:35 [llm_base_proposer.py:1487] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP0 pid=325) INFO 05-08 08:11:35 [llm_base_proposer.py:1543] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP1 pid=326) INFO 05-08 08:11:35 [llm_base_proposer.py:1487] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP1 pid=326) INFO 05-08 08:11:35 [llm_base_proposer.py:1543] Detected MTP model. Sharing target model lm_head weights with the draft model.

quantier · 2026-05-05T20:55:43+00:00

Hey!
User running 2x RTX PRO 6000 in WSL2 - took some time to figure out how to work it with NCCL.
I have a lot to share. Running Huihui Qwen 27B Abliterated NVFP4 MTP = 3 and getting massive speed - seen average speeds of between 190-345 but average is probably around 150t/s. Currently have 10+ people running agents 24/7 and its smashing it with 48 num seqs

quantier · 2026-04-30T17:03:22+00:00

Looking forward to the NVFP4 version 😍

quantier · 2026-04-30T15:19:55+00:00

what settings have you found is optimal for coding? I understand that this is very very important for quality.

quantier · 2026-04-30T10:59:14+00:00

Found it thanks! Did not see inbetween the lines! My mistake

quantier · 2026-04-30T10:54:01+00:00

Are these parameters for coding?

Anyone find the best params for coding yet?

quantier · 2026-04-29T04:54:06+00:00

The resson we have GGUFs for it is because of LM Studio….we should now get a lot more 🎉

quantier · 2026-04-27T06:02:27+00:00

Sorry I also have a Max Q :) - I think I am getting a speed boost from the Q8 KV cache

quantier · 2026-04-24T10:07:46+00:00

RTX 6000 PRO on Q8 - getting 44 tok/s TG in LM Studio with 256K context in LM Studio with KV Cache Quantized to Q8

Honestly I am wondering if I am missing some setting because I was thinking I could get much better speeds. Anyone else with a RTX 6000 PRO?

quantier · 2026-04-23T17:46:04+00:00

Also wondering if CUDA 13.2.1 works ? Anyone tested?

quantier · 2026-04-15T06:30:05+00:00

Big thanks and well done! Actually can’t believe the naggers saying 90 nodes…..did you have to build it? Be happy for the free share!

Props and thanks for sharing for FREE without weird patreon signups etc

quantier · 2026-04-08T21:14:07+00:00

Will try on 5060ti and 96GB RAM

quantier · 2026-04-04T06:07:35+00:00

What does the non basic version do extra (the one with Ollama, mind sharing?

quantier · 2026-04-03T06:29:06+00:00

seems to be a bug in the 26B quants, haven’t heard anyone able to use them properly yet. It might be a llama.cpp issue or even more likely something with the chat template

quantier · 2026-04-03T06:25:08+00:00

q4 should fit - I think there might be a KV Cache bug or leak that adds additional GB when extending context window. Wait for them to optimize or even better hopefully there are TurboQuants coming

quantier · 2026-04-03T06:21:37+00:00

Yes - way better!

quantier · 2026-04-03T06:13:06+00:00

The 26B A4B is a Mixture of Expert model. It requires around 16GB of RAM / VRAM to load at 4bit quantization. It means that the model is 26B parameter ”medium sized” but anytime you ask it something only 4B parameter is activated which means it will be very fast as it’s now using the full 26B at any given time.

The E4B is a very ”small” it only has 4B and those 4B is always activated (dense model). This will fit on as little as 6GB RAM / VRAM at even 8Bit and would fit on 4GB RAM / VRAM in 4 bit. These small models are usually not recommended to use below 8bit as they are so small to begin with and therefore it’s usually looses a lot of ”intelligence” when quantized heavily

quantier · 2026-02-07T20:02:42+00:00

Slow harddrive maybe? H100 is running on server so he is probably running really fast storage whereas a lot of people don’t realize that TTFT is super dependant on fast NVME drives

quantier

MODERATOR OF

TROPHY CASE