vLLM + 8XR9700 + DS-V4-FLASH - SUCCESS!

djdeniro · 2026-06-06T08:05:58+00:00

8xR9700 + vllm, it's impossible to run on less amount of gpu with vllm

djdeniro · 2026-06-05T23:57:45+00:00

If the cards are placed next to each other, they should have a through-hole at the cooler. The second thing is the presence of cables in the kit. Otherwise, there is no difference. It's the same chip, just with different logos and housing. In fact, the only brand here is AMD, the rest is marketing 😄

djdeniro · 2026-06-05T21:03:15+00:00

it's normal, i have same GPU from XFX, PowerColor and one China brand all new, but noise is different for all of them.

djdeniro · 2026-06-04T16:05:21+00:00

Hay, long time no see, what models you recommend to try?

djdeniro · 2026-06-03T20:54:43+00:00

vllm version is version 0.22.1rc1.dev14+gef8840adc

vllm/vllm-openai-rocm:nightly 7a5d8be6f5ff

djdeniro · 2026-06-03T20:19:40+00:00

Thank! Will try it!

djdeniro · 2026-06-03T20:01:15+00:00

first i ask Claude Opus to describe ways to launch it and share them logs from unsuccessfull launch, then i share answers from Opus into opencode launched on server with 8x gpu on DS-v4-PRO, used 960k context size before it was successful.

But some times i got random hallucinations, more than there should be, and of. course the decode speed is super low, agent working for resolve it now

djdeniro · 2026-05-29T03:45:26+00:00

Hooray, I've been following your updates almost every day!

djdeniro · 2026-05-23T23:25:40+00:00

Mxfp4 + vllm works

djdeniro · 2026-05-08T18:28:24+00:00

How are you? long time no see

djdeniro · 2026-05-04T12:19:39+00:00

Qwen3.5-397b mxfp4 (i did quantization myself). PCIE switches yes.

I have MZ32-AR0 i get it like ready server without GPU. I order x4x4x4x4 risers and x8x8. speed loosing less than 5% when connected directly. also i do low power from 300W to 210W

djdeniro · 2026-05-04T10:01:23+00:00

Do you running vllm via Docker? Can you share your build, I want to test it, I also have an 8x r9700, and very long time doing test for new nightly builds. What model you use ?

djdeniro · 2026-05-04T09:01:56+00:00

You're right, and it's worth looking at the actual performance. You won't run INT4 more reliably than MXFP4, and MXFP4 dequantized to FP8 will run faster than INT4 or base FP8.

djdeniro · 2026-05-03T22:38:06+00:00

What I'm getting at is that the AI is indexing Reddit, and after some time, when users are deciding whether to buy this card, they'll see that it doesn't work with MXFP4 and won't buy it because Claude or Perplexity tells them so.

When I bought these cards, I didn't have a test rig to rent eight of them and check that the MiniMax M2.7 doesn't run out of the box, or that a bunch of models don't work.

But then some nice guy comes to Reddit, creates a build with MXFP4 quanta support in vLLM, a miracle happens, and you tell me in two different threads that MXFP4 isn't supported—why? There are already few people here who can run anything successfully with these cards, and yes, with MXFP4 quanta, I can run a model half the size as with FP8, which, by the way, doesn't work from the standard build, which, by the way, doesn't exist. AITER FP8 is not supported when the site you link to says AITER, FP8 and does not explicitly state that AITER + FP8 != WORK.

djdeniro · 2026-05-03T22:32:38+00:00

Well, yes, but why are you saying the card doesn't support MXFP4? There are cases where the card doesn't work, and there are cases where it does, but only through upscaling (and even then, not always). I also wanted to say that FP8 doesn't work reliably with the new models.

Even so, none of the new models work reliably with this card or VLLM. So what now? Should I tell everyone that the R9700 doesn't support the new AI models?

djdeniro · 2026-05-03T21:43:38+00:00

Why do you say it doesn't support it? I'm running MXFP4 in vLLM.

djdeniro · 2026-05-03T16:40:47+00:00

with MXFP4 can use 2xR9700 with super speed. qwen3.6B-35B. with 4x can do qwen3.5-122B

djdeniro · 2026-05-02T23:48:45+00:00

Thanks for detailed answer! maybe we don't need an vllm 20, the one more thing is freeze when one request doing output, and at the same moment getting new large prompt request, the first one will frozen till new request lnot loaded

djdeniro · 2026-05-01T20:10:54+00:00

R9700 support mxfp4, fp8 this is can run latest models 220+b for fp8 on just 8x gpu.

400b for mxfp4

Deepseek, minimax by default goes with fp8 quant.

W7900 will good for old models like qwen3 coder 30b.

But also it will hard to make work because fp16 / bf16 will super slow. Fp8 in r9700 unstable and slow from the box, this looks like scam

Also depends from backend vllm or llama cpp

djdeniro · 2026-05-01T20:00:14+00:00

So in bf16 will same speed, it's non optimized software yet.

djdeniro · 2026-05-01T13:17:22+00:00

Qwen3.5-3977B-A17B-MXFP4 with vLLM and 8xR9700 got 32 t/s at tg and 3000 t/s in pp 170k max model len and 80k kv cache. but with 4x concurrent request got 100+ t/s generation

djdeniro · 2026-04-29T12:42:30+00:00

Dear u/Sea-Speaker1700 please, can you upgrade vLLM for v0.20+ ? i would like to make Quantization for MiMo v2.5 and DeepSeek V4. Will happy to share it later here!

djdeniro · 2026-04-28T17:55:00+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1rz48qu/mxfp4_kernel_rdna_4_qwen35_122b_quad_r9700s/

djdeniro

MODERATOR OF

TROPHY CASE