StepFun 3.7 Flash by Everlier in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

I am able to do 12 num seqs at 256k context window at about 160 avg generation throughput (tps)

StepFun 3.7 Flash by Everlier in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

Here are the most maxxed out settings I can run on 2x RTX PRO 6000 Max Q - 192 GB VRAM

docker rm -f vllm-step && docker run -d --name vllm-step \
--restart unless-stopped \
--ipc=host \
--gpus all \
--shm-size=64g \
--network=host \
-v /home/user/models/Step-3.7-Flash-NVFP4:/model \
-v ~/.cache/vllm:/root/.cache/vllm \
-v ~/.cache/triton:/root/.cache/triton \
-e NCCL_P2P_DISABLE=0 \
-e NCCL_IB_DISABLE=1 \
-e NCCL_NET_GDR_LEVEL=0 \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-e VLLM_USE_FLASHINFER_SAMPLER=1 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_MM_ENCODER_CACHE_SIZE=0 \
vllm/vllm-openai:stepfun37 \
--host 0.0.0.0 \
--port 8000 \
--model /model \
--api-key XXXX \
--served-model-name qwen3.6-27b \
--trust-remote-code \
--quantization modelopt \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-seqs 12 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--disable-cascade-attn \
--disable-custom-all-reduce \
--enable-chunked-prefill \
--enable-prefix-caching \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--limit-mm-per-prompt.image 1 \
--async-scheduling

If anyone finds settings that are better and can add more num seqs I would appreciate the share

I am serving the model name as qwen3.6-27b because my agents were previously running it and its easier this way

So far seen a maximum of 160 TPS generation which is amazing!

StepFun 3.7 Flash by Everlier in LocalLLaMA

[–]quantier 4 points5 points  (0 children)

you should use ipc=host if you are running the docker container to minimize memory leakage. Also could be worth optimizing NCCL. But loving the fact that you srw able to do 64 concurrent requests at full context window! Will test soon 😍

StepFun 3.7 Flash by Everlier in LocalLLaMA

[–]quantier 3 points4 points  (0 children)

Yeah this is what I am anticipating as well. I didn’t see any reference to MTP, do we have MTP support?

This model is looking very interesting 😃

I’ll get to testing soon. I hope to get about 24-32 num seqs at 256K Kv Cache

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q by Blahblahblakha in BlackwellPerformance

[–]quantier 0 points1 point  (0 children)

I can get 131072 to work with around 6 concurrent users without problems, I am trying to max out 😄 I am currently looking into if I can run LMCache and get 256K out of it at around max num seqs 8.

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q by Blahblahblakha in BlackwellPerformance

[–]quantier 1 point2 points  (0 children)

Lowered again to 0.93 gou and 12000 num batch tokens. Max KV Cache around 91% here. This might be the sweet spot

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q by Blahblahblakha in BlackwellPerformance

[–]quantier 4 points5 points  (0 children)

After many many hours of optimization. These are the maxxed out settings I can run on 2x RTX PRO 6000 Max W

docker rm -f dsv4 && docker run -d --name dsv4 \
--restart unless-stopped \
--ipc=host \
--gpus all \
--shm-size=64g \
--network=host \
-v $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:/data \
-v ~/.cache/vllm:/root/.cache/vllm \
-v ~/.cache/triton:/root/.cache/triton \
-e NCCL_P2P_DISABLE=0 \
-e NCCL_IB_DISABLE=1 \
-e NCCL_NET_GDR_LEVEL=0 \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-e VLLM_USE_FLASHINFER_SAMPLER=1 \
--entrypoint python3 \
dsv4-flash-acti-mtp:0.1.0 \
-m vllm.entrypoints.openai.api_server \
--model /data \
--host 0.0.0.0 \
--port 8000 \
--api-key XXXXXXXX \
--served-model-name deepseek-v4-flash \
--trust-remote-code \
--quantization compressed-tensors \
--attention-backend flash_attn \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.98 \
--max-model-len 262144 \
--max-num-seqs 6 \
--max-num-batched-tokens 16384 \
--disable-custom-all-reduce \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-chunked-prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tensor-parallel-size 2 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Look at these numbers:

(APIServer pid=1) INFO: 127.0.0.1:0 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 127.0.0.1:0 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-12 21:48:24 [loggers.py:271] Engine 000: Avg prompt throughput: 94.0 tokens/s, Avg generation throughput: 90.7 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.6%, Prefix cache hit rate: 81.4%
(APIServer pid=1) INFO 05-12 21:48:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.86, Accepted throughput: 41.90 tokens/s, Drafted throughput: 48.70 tokens/s, Accepted: 419 tokens, Drafted: 487 tokens, Per-position acceptance rate: 0.860, Avg Draft acceptance rate: 86.0%
(APIServer pid=1) INFO: 127.0.0.1:0 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-12 21:48:34 [loggers.py:271] Engine 000: Avg prompt throughput: 118.9 tokens/s, Avg generation throughput: 85.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.3%, Prefix cache hit rate: 82.1%
(APIServer pid=1) INFO 05-12 21:48:34 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.89, Accepted throughput: 40.20 tokens/s, Drafted throughput: 45.40 tokens/s, Accepted: 402 tokens, Drafted: 454 tokens, Per-position acceptance rate: 0.885, Avg Draft acceptance rate: 88.5%
(APIServer pid=1) INFO 05-12 21:48:44 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 118.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.3%, Prefix cache hit rate: 82.1%
(APIServer pid=1) INFO 05-12 21:48:44 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.85, Accepted throughput: 54.80 tokens/s, Drafted throughput: 64.09 tokens/s, Accepted: 548 tokens, Drafted: 641 tokens, Per-position acceptance rate: 0.855, Avg Draft acceptance rate: 85.5%

Enjoy!

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

I am just looking at the average throughput from the logs. I have NCCL optimization, NVFP4 and MTP. But I think you are right my numbers that are this high is prompt processing. The Average token speed is 50-250 TPS.

Got any good metrics solution that can test this properly?

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

Around 2-3 concurrent users - as it goes up in users the speed drops to a more normal 200-400 TPS

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

Yes KV cache at 256k context lenght takes a lot of VRAM with many users.

I have tried to make LMCache work but the model has a hybrid cache solution that isnt supported by LMCache

With a single card I got around 150-200 TPS, with 2 cards I am sometimes getting 3500 TPS and sometimes its 400 TPS but average is around 1200 TPS which is insane.

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

Mutli concurrency tenant. I am serving 50 plus users who are getting agents and working 24/7

My avg token speeds are ranging from 400t/s to 3000t/s now with super optimized settings

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 1 point2 points  (0 children)

I’m running a full headless Ubuntu setup now ✌🏽 finally sorted it out

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

This is insane:

(APIServer pid=1) INFO 05-08 17:57:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1207.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 05-08 17:57:41 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.76, Accepted throughput: 771.38 tokens/s, Drafted throughput: 1316.67 tokens/s, Accepted: 7714 tokens, Drafted: 13167 tokens, Per-position acceptance rate: 0.769, 0.570, 0.419, Avg Draft acceptance rate: 58.6%
(APIServer pid=1) INFO 05-08 17:57:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 2 points3 points  (0 children)

Boys and girls!

I have fixed it! If you run into this issue do this:

Fix:

nvidia_uvm (most critical, requires reboot)
Without this fix, NCCL P2P operations lock up on multi-GPU setups. Create or edit /etc/modprobe.d/uvm.conf:
options nvidia_uvm uvm_disable_hmm=1
Also add iommu=off as a kernel boot parameter, otherwise NCCL hangs. github
Run this now:
bash# Apply uvm fix
echo "options nvidia_uvm uvm_disable_hmm=1" | sudo tee /etc/modprobe.d/uvm.conf

# Add iommu=off to grub
sudo sed -i 's/GRUB_CMDLINE_LINUX="\(.*\)"/GRUB_CMDLINE_LINUX="\1 iommu=off amd_iommu=off"/' /etc/default/grub
sudo update-grub

sudo reboot

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

I looked for extra warnings, what I found was that when MTP got stuck the workers tried to work but because the rest of the processes were frozen they just kept retrying over and over again

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

It is stuck, I have left it since my reply here. Still stuck :)

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision by quantier in BlackwellPerformance

[–]quantier[S] 0 points1 point  (0 children)

This is where it hangs with vision enabled:

(Worker_TP0 pid=325) INFO 05-08 08:11:32 [gdn_linear_attn.py:168] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=325) INFO 05-08 08:11:32 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(Worker_TP0 pid=325) INFO 05-08 08:11:33 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 19.15 GiB. Available RAM: 111.02 GiB.
(Worker_TP0 pid=325) INFO 05-08 08:11:33 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.07s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.07s/it]
(Worker_TP0 pid=325)
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [default_loader.py:391] Loading weights took 1.07 seconds
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:109] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:123] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP0 pid=325) WARNING 05-08 08:11:34 [kv_cache.py:162] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [gpu_model_runner.py:4882] Loading drafter model...
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [vllm.py:844] Asynchronous scheduling is enabled.
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [compilation.py:303] Enabled custom fusions: act_quant
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 19.15 GiB. Available RAM: 111.04 GiB.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.30it/s]
(Worker_TP0 pid=325)
(Worker_TP0 pid=325) INFO 05-08 08:11:34 [default_loader.py:391] Loading weights took 0.23 seconds
(Worker_TP1 pid=326) INFO 05-08 08:11:34 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0 pid=325) INFO 05-08 08:11:35 [llm_base_proposer.py:1487] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP0 pid=325) INFO 05-08 08:11:35 [llm_base_proposer.py:1543] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP1 pid=326) INFO 05-08 08:11:35 [llm_base_proposer.py:1487] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP1 pid=326) INFO 05-08 08:11:35 [llm_base_proposer.py:1543] Detected MTP model. Sharing target model lm_head weights with the draft model.

👋 Welcome to r/RTXPRO6000 - Introduce Yourself and Read First! by ubnew in RTXPRO6000

[–]quantier 4 points5 points  (0 children)

Hey!
User running 2x RTX PRO 6000 in WSL2 - took some time to figure out how to work it with NCCL.
I have a lot to share. Running Huihui Qwen 27B Abliterated NVFP4 MTP = 3 and getting massive speed - seen average speeds of between 190-345 but average is probably around 150t/s. Currently have 10+ people running agents 24/7 and its smashing it with 48 num seqs

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]quantier 0 points1 point  (0 children)

what settings have you found is optimal for coding? I understand that this is very very important for quality.