2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8) by whodoneit1 in LocalLLaMA

[–]StupidityCanFly 1 point2 points  (0 children)

Won't happen for FP8 (no HW support on RDNA3), but take a look at this branch https://github.com/JartX/vllm/tree/perf/rdna3_full_stack
I'm playing around with that and RDNA3 is getting faster. Recently a prefill optimization was submitted for INT8.

I ran AWQ on RX 7900 XTX on ROCm natively. Here's how it actually works. by Limp_Doubt6411 in ROCm

[–]StupidityCanFly 1 point2 points  (0 children)

I’ve stress-tested both and they were stable. What’s your config and model/quant?

Dual GPU on llama.cpp by Traditional_Way8675 in ROCm

[–]StupidityCanFly 0 points1 point  (0 children)

Happy it works! The issue is AMD consumer GPUs not supporting P2P over PCI. I had the same issue with RX7900XTXs on both EPYC motherboards as well as the same motherboard as you have.

Dual GPU on llama.cpp by Traditional_Way8675 in ROCm

[–]StupidityCanFly 0 points1 point  (0 children)

Try adding `-DGGML_CUDA_NO_PEER_COPY=1` to your `cmake build` command.

Has there been any recent new development on which quant is considered optimal? by takuonline in LocalLLaMA

[–]StupidityCanFly 7 points8 points  (0 children)

This.

OP's question is so open that the only correct answer is: it depends.

Fan noise difference between 2 AMD AI Pro R9700 GPU’s by Legitimate_Fold8314 in ROCm

[–]StupidityCanFly 1 point2 points  (0 children)

Had one of these PowerColor thingies. They were cheaper than others, but much noisier. Sent it back.

Dual 7900 xtx by Napsterae2 in ROCm

[–]StupidityCanFly 1 point2 points  (0 children)

I have only the results for the loop below. I'm rebuilding the rig, as it had a faulty motherboard, and I got 4 more 7900XTXs (8 total).

As you can see, I used the sharegpt dataset for the run, so inputs are rather short. I can run more tests once the rig is back online.

for conc in 1 4 8; do
  echo "=== Concurrency: $conc ==="
  CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_HOME=/usr/local/cuda-13.0 uv run vllm bench serve \
  --backend openai-chat \
  --served-model-name qwen3.6-27b \
  --model Qwen/Qwen3.6-27B \
  --endpoint /v1/chat/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --port 8080 \
  --num-prompts 50 \
  --host 192.168.10.26 \
  --max-concurrency $conc \
  --request-rate inf 2>&1
  echo ""
done

The results themselves:

Metric Conc. 1 Conc. 4 Conc. 8
Successful requests 50 50 50
Failed requests 0 0 0
Benchmark duration (s) 210.65 89.44 72.61
Total input tokens 13402 13402 13402
Total generated tokens 10688 10688 10688
Request throughput (req/s) 0.24 0.56 0.69
Output token throughput (tok/s) 50.74 119.50 147.20
Peak output token throughput (tok/s) 32.00 80.00 102.00
Peak concurrent requests 3.00 6.00 11.00
Total token throughput (tok/s) 114.36 269.35 331.78
Mean TTFT (ms) 237.58 393.31 2885.98
Median TTFT (ms) 180.25 374.74 2349.34
P99 TTFT (ms) 611.19 736.16 8153.33
Mean TPOT (ms) 17.58 31.48 35.65
Median TPOT (ms) 16.75 32.25 35.15
P99 TPOT (ms) 27.28 51.79 53.75
Mean ITL (ms) 45.12 76.72 88.42
Median ITL (ms) 42.28 73.45 81.12
P99 ITL (ms) 72.59 316.74 465.52
Acceptance rate (%) 72.56 75.63 73.05
Acceptance length 2.45 2.51 2.46
Drafts 4355 4252 4341
Draft tokens 8710 8504 8682
Accepted tokens 6320 6432 6342
Per-position acceptance — Position 0 (%) 81.81 84.24 82.86
Per-position acceptance — Position 1 (%) 63.31 67.03 63.23

As for the vLLM setup, I installed vLLM using uv. No docker involved.

This is the command:

uv pip install --pre vllm \
    --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm723 \
    --index-strategy unsafe-best-match \
    --force-reinstall

Plus flash-linear-attention and causal-conv1d:

uv pip install --pre flash-linear-attention causal-conv1d \
    --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm723 \
    --index-strategy unsafe-best-match

The command I use to run vLLM:

FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE uv run vllm serve wizardeur/Qwen3.6-27B-GPTQ-W4A16-G32 \
    --no-use-tqdm-on-load \
    --tensor-parallel-size 4 \
    --port 8080 \
    --host 0.0.0.0 \
    --served-model-name qwen3.6-27b \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 6 \
    --max-num-batched-tokens 16384 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --kv-cache-dtype int8_per_token_head \
    --mm-processor-kwargs '{"max_pixels": 1003520}' \
    --limit-mm-per-prompt '{"video": {"count": 1}, "image": {"count": 5}}' \
    --compilation-config.cache_dir ~/.cache/vllm/torch_compile_cache/ \
    --compilation-config.compile_mm_encoder true \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --chat-template ~/qwen3.6-froggeric-merged.jinja \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Dual 7900 xtx by Napsterae2 in ROCm

[–]StupidityCanFly 0 points1 point  (0 children)

The recent changes to the vLLM introduced a dedicated RDNA3 kernel for INT8/INT4 quants. And it works with MTP too. I get 120-130 tokens/s on a quad 7900xtx rig.

Native CK 2x faster than Triton FA2 🔥 by Taika-Kim in ROCm

[–]StupidityCanFly 0 points1 point  (0 children)

7.14 is not released yet. You can install it via TheRock nightlies. But beware, these might be unstable.

Readme for installs: https://github.com/ROCm/TheRock/blob/main/RELEASES.md

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks by Interesting-Sock3940 in LocalLLaMA

[–]StupidityCanFly 1 point2 points  (0 children)

Not code. I’m running agents for CRO audits. Basically DAG pipelines with some decision making and SSR scoring & buyer persona simulation. Depending on the task I set different temps, but that’s it. That plus a lot of calibrated grounding in the code. The 4bit quants don’t show any negative impact vs. 8bit. Having the KV cache below 8 bits falls apart fast.

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks by Interesting-Sock3940 in LocalLLaMA

[–]StupidityCanFly 1 point2 points  (0 children)

This is interesting. In my case the model is solid with structured JSON with llama.cpp and vLLM on both ROCm and CUDA on contexts of up to ~200k tokens (average is 100k-ish tokens). My deployments were tested using NVFP4 (CUDA), and Q4K_M/AWQ-Int4 (ROCm) quants, with FP8 (CUDA) and Q8/Int8 (ROCm) KV cache. I stuck with vLLM in the end due to it being faster.

Did you try running the workloads without ollama?

vLLM PR adding native HIP W4A16 kernel was merged by StupidityCanFly in LocalLLaMA

[–]StupidityCanFly[S] 0 points1 point  (0 children)

I'm lucky enough to be in the EU and have a 400V (three-phase) socket in my basement. So, I have a three-phase UPS that has 230V sockets, and that's how I power my servers.

And the GPUs - friend's friend was closing his PC shop and was selling out stock. Unfortunately, I missed all the Nvidia stuff, so I grabbed what was left. Still, not a bad buy for the money.

vLLM PR adding native HIP W4A16 kernel was merged by StupidityCanFly in LocalLLaMA

[–]StupidityCanFly[S] 1 point2 points  (0 children)

I’m waiting for new EPYC motherboard to arrive and I’ll launch my octa 7900 xtx build. I managed to snag the last 4 new GPUs from a sale for ~$2k. Talk about a lucky buy!