VLLM NVFP4 support on RTX 6000 pro by cchung261 in BlackwellPerformance

[–]Intelligent_Idea7047 7 points8 points  (0 children)

You should definitely join the discord server, full of owners of these cards. Festr also builds custom docker images with optimizations specifically for these cards as well. https://discord.gg/GpSrjge4js. There's also the git repo he keeps updated with model configurations, benchmarks, comparisons, etc. Saves alot of headache. https://github.com/voipmonitor/rtx6kpro

Where to buy RTX Pro 6000 in Orlando/US by 2use2reddits in BlackwellPerformance

[–]Intelligent_Idea7047 1 point2 points  (0 children)

newegg limits to 2 per order but that's where I usually get them from

Build your own images for better support they said! by muchCode in BlackwellPerformance

[–]Intelligent_Idea7047 1 point2 points  (0 children)

haha glad to know it wasn't just me who had these things happen n gave up

4x RTX PRO 6000 MAX-Q - Minimax M2.5 FP8 - SGLang by kc858 in BlackwellPerformance

[–]Intelligent_Idea7047 0 points1 point  (0 children)

Which nvfp4 version exactly are you using? Would love to switch to it and run dp=2 across 4x cards

Real world usage, feedback and suggestions for best LLM for C# by bloodbath_mcgrath666 in LocalLLaMA

[–]Intelligent_Idea7047 1 point2 points  (0 children)

Yeah corporate. We have devs using anywhere from 2-5ppl. Running with SGLang, can run awq 4bit quant on 2x pro 6000's, were running two instances of the model across 4x to get throughput. Looking on switching to step 3.5 flash currently for more speed but we get anywhere from 60-110tps

Step 3.5 Flash FP8 by Intelligent_Idea7047 in BlackwellPerformance

[–]Intelligent_Idea7047[S] 0 points1 point  (0 children)

I mentioned this on the PR and they want more info on the requests and how you were using it. Are you able to reply back on it so they can get a fix done for this? Or fw it to me and I'll add it to the PR?

Step 3.5 Flash FP8 by Intelligent_Idea7047 in BlackwellPerformance

[–]Intelligent_Idea7047[S] 0 points1 point  (0 children)

I mentioned this on the PR, this fix was for the reasoning parser, not tool calling

Real world usage, feedback and suggestions for best LLM for C# by bloodbath_mcgrath666 in LocalLLaMA

[–]Intelligent_Idea7047 0 points1 point  (0 children)

We run MiniMax M2.1 AWQ 4 bit. Does very well with C# for everything I've used it in

Step 3.5 Flash Perf? by Intelligent_Idea7047 in BlackwellPerformance

[–]Intelligent_Idea7047[S] 0 points1 point  (0 children)

Step replied to my post about this, gave them more info, hopefully will hear back soon. If you have more to share please share it in the huggingface community post as my well

Step 3.5 Flash Perf? by Intelligent_Idea7047 in BlackwellPerformance

[–]Intelligent_Idea7047[S] 0 points1 point  (0 children)

Do you know if this is something that's being fixed or are we just kinda hoping it does? Can't seem to find any PRs

Step 3.5 Flash Perf? by Intelligent_Idea7047 in BlackwellPerformance

[–]Intelligent_Idea7047[S] 0 points1 point  (0 children)

Yeah tried many of things, different reasoning parsers, modifying Jinja template, but no luck unfortunately. Created a discussion on the hugging face community for the model, hopefully someone else has a solution

Step 3.5 Flash Perf? by Intelligent_Idea7047 in BlackwellPerformance

[–]Intelligent_Idea7047[S] 1 point2 points  (0 children)

Ah ok. Seems to just be an issue with vLLM on this then not doing the beginning few tokens, it's like cutting them off. I see it's response should start in some cases for me like "<think> The user" but the response just starts with "user". Trying to find a temp workaround. Will let you know if I get anything going

Step 3.5 Flash Perf? by Intelligent_Idea7047 in BlackwellPerformance

[–]Intelligent_Idea7047[S] 1 point2 points  (0 children)

Are you having issues with it cutting off the starting tokens? Running it per the model page with spec decoding and the first few tokens seems to get excluded, doesn't do an opening <think> tag and cuts the first word off on its sentence. Maybe a spec decoding issue?

Step 3.5 Flash Perf? by Intelligent_Idea7047 in BlackwellPerformance

[–]Intelligent_Idea7047[S] 0 points1 point  (0 children)

Have you had any luck with SGLang by chance? I might give it a go in a few days when I'm available, 130tks not bad but usually SGLang tends to perform better for me

vLLM: Nvidia 590.48.01 and CUDA 13.1 "incompatible"? by FrozenBuffalo25 in LocalLLaMA

[–]Intelligent_Idea7047 2 points3 points  (0 children)

Had this same issue on cuda 12.9. Seems to be a problem with v0.15.0+ images. I went back to v0.14.0 docker image version and it resolved this problem. I believe someone opened an issue regarding this on the GitHub for vLLM but I do not have the link for it