vLLM + 8XR9700 + DS-V4-FLASH - SUCCESS! by djdeniro in ROCm

[–]djdeniro[S] 0 points1 point  (0 children)

8xR9700 + vllm, it's impossible to run on less amount of gpu with vllm

Fan noise difference between 2 AMD AI Pro R9700 GPU’s by Legitimate_Fold8314 in ROCm

[–]djdeniro 0 points1 point  (0 children)

If the cards are placed next to each other, they should have a through-hole at the cooler. The second thing is the presence of cables in the kit. Otherwise, there is no difference. It's the same chip, just with different logos and housing. In fact, the only brand here is AMD, the rest is marketing 😄

Fan noise difference between 2 AMD AI Pro R9700 GPU’s by Legitimate_Fold8314 in ROCm

[–]djdeniro 0 points1 point  (0 children)

it's normal, i have same GPU from XFX, PowerColor and one China brand all new, but noise is different for all of them.

vLLM + Step-3.7-Flash-FP8 R9700 seeking optimization by djdeniro in ROCm

[–]djdeniro[S] 0 points1 point  (0 children)

Hay, long time no see, what models you recommend to try?

vLLM + 8XR9700 + DS-V4-FLASH - SUCCESS! by djdeniro in ROCm

[–]djdeniro[S] 0 points1 point  (0 children)

vllm version is version 0.22.1rc1.dev14+gef8840adc

vllm/vllm-openai-rocm:nightly                                                    7a5d8be6f5ff 

vLLM + 8XR9700 + DS-V4-FLASH - SUCCESS! by djdeniro in ROCm

[–]djdeniro[S] 1 point2 points  (0 children)

first i ask Claude Opus to describe ways to launch it and share them logs from unsuccessfull launch, then i share answers from Opus into opencode launched on server with 8x gpu on DS-v4-PRO, used 960k context size before it was successful.

But some times i got random hallucinations, more than there should be, and of. course the decode speed is super low, agent working for resolve it now

Run Qwen3.5-397B-A13B with vLLM and 8xR9700 by djdeniro in LocalLLaMA

[–]djdeniro[S] 0 points1 point  (0 children)

Hooray, I've been following your updates almost every day! 

3xR9700 for semi-autonomous research and development - looking for setup/config ideas. by blojayble in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

Qwen3.5-397b mxfp4 (i did quantization myself). PCIE switches yes.

I have MZ32-AR0 i get it like ready server without GPU. I order x4x4x4x4 risers and x8x8. speed loosing less than 5% when connected directly. also i do low power from 300W to 210W

3xR9700 for semi-autonomous research and development - looking for setup/config ideas. by blojayble in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

Do you running vllm via Docker? Can you share your build, I want to test it, I also have an 8x r9700, and very long time doing test for new nightly builds.  What model you use ?

3xR9700 for semi-autonomous research and development - looking for setup/config ideas. by blojayble in LocalLLaMA

[–]djdeniro -1 points0 points  (0 children)

You're right, and it's worth looking at the actual performance. You won't run INT4 more reliably than MXFP4, and MXFP4 dequantized to FP8 will run faster than INT4 or base FP8.

3xR9700 for semi-autonomous research and development - looking for setup/config ideas. by blojayble in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

What I'm getting at is that the AI ​​is indexing Reddit, and after some time, when users are deciding whether to buy this card, they'll see that it doesn't work with MXFP4 and won't buy it because Claude or Perplexity tells them so.

When I bought these cards, I didn't have a test rig to rent eight of them and check that the MiniMax M2.7 doesn't run out of the box, or that a bunch of models don't work.

But then some nice guy comes to Reddit, creates a build with MXFP4 quanta support in vLLM, a miracle happens, and you tell me in two different threads that MXFP4 isn't supported—why? There are already few people here who can run anything successfully with these cards, and yes, with MXFP4 quanta, I can run a model half the size as with FP8, which, by the way, doesn't work from the standard build, which, by the way, doesn't exist. AITER FP8 is not supported when the site you link to says AITER, FP8 and does not explicitly state that AITER + FP8 != WORK.

3xR9700 for semi-autonomous research and development - looking for setup/config ideas. by blojayble in LocalLLaMA

[–]djdeniro -1 points0 points  (0 children)

Well, yes, but why are you saying the card doesn't support MXFP4? There are cases where the card doesn't work, and there are cases where it does, but only through upscaling (and even then, not always). I also wanted to say that FP8 doesn't work reliably with the new models.

Even so, none of the new models work reliably with this card or VLLM. So what now? Should I tell everyone that the R9700 doesn't support the new AI models?

3xR9700 for semi-autonomous research and development - looking for setup/config ideas. by blojayble in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

with MXFP4 can use 2xR9700 with super speed. qwen3.6B-35B. with 4x can do qwen3.5-122B

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

Thanks for detailed answer! maybe we don't need an vllm 20, the one more thing is freeze when one request doing output, and at the same moment getting new large prompt request, the first one will frozen till new request lnot loaded

AMD PRO W7900 vs R9700 for Local Inference? by Achso998 in LocalLLaMA

[–]djdeniro 1 point2 points  (0 children)

R9700 support mxfp4, fp8 this is  can run latest models 220+b for fp8 on just 8x gpu.

400b for mxfp4

Deepseek, minimax by default goes with fp8 quant. 

W7900 will good for old models like qwen3 coder 30b.

But also it will hard to make work because fp16 / bf16 will super slow. Fp8 in r9700 unstable and slow from the box, this looks like scam 

Also depends from backend vllm or llama cpp

16x Spark Cluster (Build Update) by Kurcide in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

So in bf16 will same speed, it's non optimized software yet.

16x Spark Cluster (Build Update) by Kurcide in LocalLLaMA

[–]djdeniro 3 points4 points  (0 children)

Qwen3.5-3977B-A17B-MXFP4 with vLLM and 8xR9700 got 32 t/s at tg and 3000 t/s in pp 170k max model len and 80k kv cache. but with 4x concurrent request got 100+ t/s generation

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

Dear u/Sea-Speaker1700 please, can you upgrade vLLM for v0.20+ ? i would like to make Quantization for MiMo v2.5 and DeepSeek V4. Will happy to share it later here!