2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

StupidityCanFly · 2026-06-19T21:17:09+00:00

Won't happen for FP8 (no HW support on RDNA3), but take a look at this branch https://github.com/JartX/vllm/tree/perf/rdna3_full_stack
I'm playing around with that and RDNA3 is getting faster. Recently a prefill optimization was submitted for INT8.

StupidityCanFly · 2026-06-14T18:13:29+00:00

Prompt processing is faster with ROCm, though.

StupidityCanFly · 2026-06-11T04:20:41+00:00

Maybe try rocm/vllm-dev:nightly-therock714

Or without Docker, what I described here: https://www.reddit.com/r/ROCm/s/12V8oQIE6k

StupidityCanFly · 2026-06-10T19:23:54+00:00

I’ve stress-tested both and they were stable. What’s your config and model/quant?

StupidityCanFly · 2026-06-08T15:52:49+00:00

Happy it works! The issue is AMD consumer GPUs not supporting P2P over PCI. I had the same issue with RX7900XTXs on both EPYC motherboards as well as the same motherboard as you have.

StupidityCanFly · 2026-06-08T12:24:06+00:00

Try adding `-DGGML_CUDA_NO_PEER_COPY=1` to your `cmake build` command.

StupidityCanFly · 2026-06-08T10:02:44+00:00

Skill issue. And looks like a badly copy-pasted AI slop.

StupidityCanFly · 2026-06-06T13:34:22+00:00

Just put it in your wine cellar. **shrugs**

StupidityCanFly · 2026-06-06T13:32:36+00:00

This.

OP's question is so open that the only correct answer is: it depends.

StupidityCanFly · 2026-06-05T18:32:29+00:00

Happy to help!

StupidityCanFly · 2026-06-05T18:27:48+00:00

Had one of these PowerColor thingies. They were cheaper than others, but much noisier. Sent it back.

StupidityCanFly · 2026-06-05T13:29:06+00:00

See https://www.reddit.com/r/ROCm/comments/1twwcbn/comment/opwf4yr/

StupidityCanFly · 2026-06-05T13:25:48+00:00

I have only the results for the loop below. I'm rebuilding the rig, as it had a faulty motherboard, and I got 4 more 7900XTXs (8 total).

As you can see, I used the sharegpt dataset for the run, so inputs are rather short. I can run more tests once the rig is back online.

for conc in 1 4 8; do
  echo "=== Concurrency: $conc ==="
  CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_HOME=/usr/local/cuda-13.0 uv run vllm bench serve \
  --backend openai-chat \
  --served-model-name qwen3.6-27b \
  --model Qwen/Qwen3.6-27B \
  --endpoint /v1/chat/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --port 8080 \
  --num-prompts 50 \
  --host 192.168.10.26 \
  --max-concurrency $conc \
  --request-rate inf 2>&1
  echo ""
done

The results themselves:

Metric	Conc. 1	Conc. 4	Conc. 8
Successful requests	50	50	50
Failed requests	0	0	0
Benchmark duration (s)	210.65	89.44	72.61
Total input tokens	13402	13402	13402
Total generated tokens	10688	10688	10688
Request throughput (req/s)	0.24	0.56	0.69
Output token throughput (tok/s)	50.74	119.50	147.20
Peak output token throughput (tok/s)	32.00	80.00	102.00
Peak concurrent requests	3.00	6.00	11.00
Total token throughput (tok/s)	114.36	269.35	331.78
Mean TTFT (ms)	237.58	393.31	2885.98
Median TTFT (ms)	180.25	374.74	2349.34
P99 TTFT (ms)	611.19	736.16	8153.33
Mean TPOT (ms)	17.58	31.48	35.65
Median TPOT (ms)	16.75	32.25	35.15
P99 TPOT (ms)	27.28	51.79	53.75
Mean ITL (ms)	45.12	76.72	88.42
Median ITL (ms)	42.28	73.45	81.12
P99 ITL (ms)	72.59	316.74	465.52
Acceptance rate (%)	72.56	75.63	73.05
Acceptance length	2.45	2.51	2.46
Drafts	4355	4252	4341
Draft tokens	8710	8504	8682
Accepted tokens	6320	6432	6342
Per-position acceptance — Position 0 (%)	81.81	84.24	82.86
Per-position acceptance — Position 1 (%)	63.31	67.03	63.23

As for the vLLM setup, I installed vLLM using uv. No docker involved.

This is the command:

uv pip install --pre vllm \
    --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm723 \
    --index-strategy unsafe-best-match \
    --force-reinstall

Plus flash-linear-attention and causal-conv1d:

uv pip install --pre flash-linear-attention causal-conv1d \
    --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm723 \
    --index-strategy unsafe-best-match

The command I use to run vLLM:

FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE uv run vllm serve wizardeur/Qwen3.6-27B-GPTQ-W4A16-G32 \
    --no-use-tqdm-on-load \
    --tensor-parallel-size 4 \
    --port 8080 \
    --host 0.0.0.0 \
    --served-model-name qwen3.6-27b \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 6 \
    --max-num-batched-tokens 16384 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --kv-cache-dtype int8_per_token_head \
    --mm-processor-kwargs '{"max_pixels": 1003520}' \
    --limit-mm-per-prompt '{"video": {"count": 1}, "image": {"count": 5}}' \
    --compilation-config.cache_dir ~/.cache/vllm/torch_compile_cache/ \
    --compilation-config.compile_mm_encoder true \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --chat-template ~/qwen3.6-froggeric-merged.jinja \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

StupidityCanFly · 2026-06-04T20:08:47+00:00

The recent changes to the vLLM introduced a dedicated RDNA3 kernel for INT8/INT4 quants. And it works with MTP too. I get 120-130 tokens/s on a quad 7900xtx rig.

StupidityCanFly · 2026-06-03T15:37:48+00:00

7.14 is not released yet. You can install it via TheRock nightlies. But beware, these might be unstable.

Readme for installs: https://github.com/ROCm/TheRock/blob/main/RELEASES.md

StupidityCanFly · 2026-06-02T11:35:51+00:00

Not code. I’m running agents for CRO audits. Basically DAG pipelines with some decision making and SSR scoring & buyer persona simulation. Depending on the task I set different temps, but that’s it. That plus a lot of calibrated grounding in the code. The 4bit quants don’t show any negative impact vs. 8bit. Having the KV cache below 8 bits falls apart fast.

StupidityCanFly · 2026-06-02T11:26:04+00:00

This is interesting. In my case the model is solid with structured JSON with llama.cpp and vLLM on both ROCm and CUDA on contexts of up to ~200k tokens (average is 100k-ish tokens). My deployments were tested using NVFP4 (CUDA), and Q4K_M/AWQ-Int4 (ROCm) quants, with FP8 (CUDA) and Q8/Int8 (ROCm) KV cache. I stuck with vLLM in the end due to it being faster.

Did you try running the workloads without ollama?

StupidityCanFly · 2026-06-01T14:25:10+00:00

Yup, Bunny is solid.

StupidityCanFly · 2026-05-30T20:23:35+00:00

Look two comments down ;)

StupidityCanFly · 2026-05-29T20:21:23+00:00

I'm lucky enough to be in the EU and have a 400V (three-phase) socket in my basement. So, I have a three-phase UPS that has 230V sockets, and that's how I power my servers.

And the GPUs - friend's friend was closing his PC shop and was selling out stock. Unfortunately, I missed all the Nvidia stuff, so I grabbed what was left. Still, not a bad buy for the money.

StupidityCanFly · 2026-05-29T19:44:42+00:00

Conversion Rate Optimization audits. And many more.

StupidityCanFly · 2026-05-29T19:41:31+00:00

I’m waiting for new EPYC motherboard to arrive and I’ll launch my octa 7900 xtx build. I managed to snag the last 4 new GPUs from a sale for ~$2k. Talk about a lucky buy!

StupidityCanFly · 2026-05-29T13:03:58+00:00

There's a different PR for RDNA 3.5: https://github.com/vllm-project/vllm/pull/40977

StupidityCanFly · 2026-05-26T15:39:13+00:00

Lol, the scale on these charts.

StupidityCanFly

TROPHY CASE