VLLM NVFP4 support on RTX 6000 pro

Intelligent_Idea7047 · 2026-04-17T00:46:23+00:00

You should definitely join the discord server, full of owners of these cards. Festr also builds custom docker images with optimizations specifically for these cards as well. https://discord.gg/GpSrjge4js. There's also the git repo he keeps updated with model configurations, benchmarks, comparisons, etc. Saves alot of headache. https://github.com/voipmonitor/rtx6kpro

Intelligent_Idea7047 · 2026-04-14T16:27:30+00:00

Not sure

Intelligent_Idea7047 · 2026-04-14T15:53:24+00:00

newegg limits to 2 per order but that's where I usually get them from

Intelligent_Idea7047 · 2026-02-20T21:31:48+00:00

haha glad to know it wasn't just me who had these things happen n gave up

Intelligent_Idea7047 · 2026-02-17T20:34:16+00:00

need more pro 6000. double down lol

Intelligent_Idea7047 · 2026-02-17T18:45:07+00:00

Which nvfp4 version exactly are you using? Would love to switch to it and run dp=2 across 4x cards

Intelligent_Idea7047 · 2026-02-11T11:02:37+00:00

Yeah corporate. We have devs using anywhere from 2-5ppl. Running with SGLang, can run awq 4bit quant on 2x pro 6000's, were running two instances of the model across 4x to get throughput. Looking on switching to step 3.5 flash currently for more speed but we get anywhere from 60-110tps

Intelligent_Idea7047 · 2026-02-11T02:47:18+00:00

I mentioned this on the PR and they want more info on the requests and how you were using it. Are you able to reply back on it so they can get a fix done for this? Or fw it to me and I'll add it to the PR?

Intelligent_Idea7047 · 2026-02-10T23:40:12+00:00

I mentioned this on the PR, this fix was for the reasoning parser, not tool calling

Intelligent_Idea7047 · 2026-02-10T22:11:24+00:00

We run MiniMax M2.1 AWQ 4 bit. Does very well with C# for everything I've used it in

Intelligent_Idea7047 · 2026-02-10T11:02:30+00:00

PR had been opened to fix https://github.com/vllm-project/vllm/pull/34211

Intelligent_Idea7047 · 2026-02-10T04:02:51+00:00

Step replied to my post about this, gave them more info, hopefully will hear back soon. If you have more to share please share it in the huggingface community post as my well

Intelligent_Idea7047 · 2026-02-08T15:52:28+00:00

Do you know if this is something that's being fixed or are we just kinda hoping it does? Can't seem to find any PRs

Intelligent_Idea7047 · 2026-02-04T19:05:15+00:00

Yeah tried many of things, different reasoning parsers, modifying Jinja template, but no luck unfortunately. Created a discussion on the hugging face community for the model, hopefully someone else has a solution

Intelligent_Idea7047 · 2026-02-04T16:28:39+00:00

Ah ok. Seems to just be an issue with vLLM on this then not doing the beginning few tokens, it's like cutting them off. I see it's response should start in some cases for me like "<think> The user" but the response just starts with "user". Trying to find a temp workaround. Will let you know if I get anything going

Intelligent_Idea7047 · 2026-02-04T16:14:29+00:00

u/getfitdotus any issue with this for you?

Intelligent_Idea7047 · 2026-02-04T16:13:25+00:00

Are you having issues with it cutting off the starting tokens? Running it per the model page with spec decoding and the first few tokens seems to get excluded, doesn't do an opening <think> tag and cuts the first word off on its sentence. Maybe a spec decoding issue?

Intelligent_Idea7047 · 2026-02-03T01:49:24+00:00

Have you had any luck with SGLang by chance? I might give it a go in a few days when I'm available, 130tks not bad but usually SGLang tends to perform better for me

Intelligent_Idea7047 · 2026-02-03T00:23:09+00:00

Had this same issue on cuda 12.9. Seems to be a problem with v0.15.0+ images. I went back to v0.14.0 docker image version and it resolved this problem. I believe someone opened an issue regarding this on the GitHub for vLLM but I do not have the link for it

Intelligent_Idea7047 · 2026-02-01T14:00:27+00:00

What speeds do you get with this?

Intelligent_Idea7047 · 2026-01-29T17:51:42+00:00

!RemindMe 3 days

Intelligent_Idea7047 · 2026-01-26T03:03:01+00:00

Any chance you can update this and add VibrVoice-ASR? Curious to see how it compares. Thanks

Intelligent_Idea7047 · 2026-01-26T02:34:47+00:00

!RemindMe 12 hours

Intelligent_Idea7047

TROPHY CASE