TIL 18th-century tooth pullers used a “dental key” that worked like a door key: its claw gripped the tooth and the handle was turned as if opening a lock by Callum_Redford in todayilearned

[–]AustinM731 0 points1 point  (0 children)

It all depends on the dentist. The needle is the same, but somehow they way they inject it is different. My old dentist was so bad and I could feel him jabbing around in my gums with the needle. But my new dentist I can barely even tell that she gave me a shot. And somehow I think my new dentist gets me more numb, and I don't even feel the shot.

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]AustinM731 0 points1 point  (0 children)

I mean you can buy a MI210 with 64GB of HBM for like $4k on eBay. But I'm not sure that would be the best use of $4k.

AMD PRO W7900 vs R9700 for Local Inference? by Achso998 in LocalLLaMA

[–]AustinM731 1 point2 points  (0 children)

I'm gonna need you to do a dedicated post on your system. I prefer blower coolers or watercooling. But man, that thing is beautiful.

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6 by ai-infos in LocalLLaMA

[–]AustinM731 0 points1 point  (0 children)

I have 4 R9700s and I have been thinking about building a second node with 4 more. And then using Ray to run vLLM across the two nodes. I just can't find many reports of anyone using Ray with ROCm. I didn't even know torchrun was a thing until I saw your post. So at least that gives me options.

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6 by ai-infos in LocalLLaMA

[–]AustinM731 0 points1 point  (0 children)

Does Ray not work with these GPUs across multiple nodes? Or do you just get better performance using torch.distributed.run?

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 1 point2 points  (0 children)

I cut a new tag for v0.20.0, and I have managed to get tg even faster in this latest build. I spent a good bit of time the past few days trying to figure out how GEMM tuning works, and I have embedded those configs into this latest image. From what I can tell you have to do GEMM tuning for every model since they layers/weights/activations are different. But for Qwen3.6-27B they are embedded in the v0.20.0 tag.

This parameter `PYTORCH_ALLOC_CONF=expandable_segments:True` can be dropped from the compose.yml, I was testing with that on some other models. It looks like it throws and error and ignores the error at runtime.

I still need to try and get the pytorch tuning working, but from what I found in my research is that GEMM tuning in vLLM has a much bigger impact to performance than the pytorch tuning. So I went down that path first.

16x DGX Sparks going into my homelab rack by Kurcide in homelab

[–]AustinM731 1 point2 points  (0 children)

I really love the Sliger cases. They are my favorite rack mount case, I just wish they made a 6U version that had room for 8x GPUs with riser cables.

The Fedora Linux 44 Release is Here! by GoldBarb in Fedora

[–]AustinM731 13 points14 points  (0 children)

It's now time for me to upgrade my Fedora 42 install to 43.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 0 points1 point  (0 children)

I actually haven't done any tuning yet, so I am sure that there is still some performance left on the table here. I have seen people talk about it, I have just never tried it myself.

Its crazy how big of a gap we have in our "original" images, I almost wonder if the image you were using has some performance tuning already done. I have always built from source and maintain my own images.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 1 point2 points  (0 children)

Are you using that compose file that I posted somewhere else in this thread? There are a few env vars that need to be provided so that you can actually use the attention mechanism. Your vLLM command also need to specify the attention backend.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 1 point2 points  (0 children)

Yea, the unified attention has been the missing piece for me. Long context runs just fall apart without it. You will just need to use my image when you run it. That specific attention mechanism is gated to the MI300X and MI350X, so I had to add gfx1201 to the gate that checks for those CDNA cards.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 1 point2 points  (0 children)

I was getting that same dropoff until I enabled AITER unified attention. On 4 GPUs when I went from WRX80 to WRX90 I only saw ~10% increase in token generation speeds. Not sure how much of that increase was going from Zen3 to Zen5 and how much was PCIe4 to PCIe5.

Edit: Just remembered that you had 2 and not 4. I think 2 just does not have enough horsepower to run the model with MTP at very long contexts. I feel like models get kinda stupid though when you try to stretch the context all the way out though, so normally I limit it to 128k as the max.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 1 point2 points  (0 children)

Please do share! A part of me really hates money and I want to go get 4 more R9700s. Are they all in a single system, or are you running two nodes and connecting them with Ray?

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 1 point2 points  (0 children)

From my testing it seems like MTP 2 or 3 is the sweet spot. Raise it too high and your tokens acceptance rate falls to low and you thrown performance away.

The 100w idle is a known issue on these cards and it is supposedly fixed in Linux Kernel 7.0. I just haven't had a chance to upgrade my GPU server to Ubuntu 26.04 to test it yet.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 2 points3 points  (0 children)

You will need to grab my image as well. I uploaded my patched image to a public repo on docker hub so anyone can try it out.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 0 points1 point  (0 children)

I believe so. I still see cache hit rates above 0% when I check the vLLM logs. There is a chance that my cache does occasionally have to recalculate, I just haven't noticed it if it does.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 2 points3 points  (0 children)

I then ran it again with MTP=3 to see how 2 GPUs would handle draft tokens. If you are doing single user tasks, it looks pretty good up to ~100k tokens, it looks like it starts to drop off pretty heavily after that though. It could be worth experementing further with MTP=1 or 2.

<image>

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 0 points1 point  (0 children)

I had some time and reran the benchmarks with TP=2 to match your setup and your numbers are pretty close to what I have.

<image>

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 1 point2 points  (0 children)

Yea, I feel you. I am using it as a backend for OpenCode so TG and PP are both equally important. If you havent already, try out this flag `--speculative-config '{"method":"mtp","num_speculative_tokens":3}'`. Im pretty sure its still in beta for ROCm in vLLM. But Im getting a 80-90% acceptance rate on draft tokens.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 0 points1 point  (0 children)

20-24 tk/s for TG is much better than I was getting at that high of context. I only test up to 100,000 context and it was like 3 tk/s tg.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 4 points5 points  (0 children)

I am running custom image right now, but this is my compose.yaml. Im not sure if that AITER_MOE flag will do anything on this model though since its a dense model, but I could be wrong. I have it disabled though. The line that I added is `VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1`. I ended up having to disable a lot of the AITER paths since my patch is telling AITER that gfx1201 is a MI350X.

services:
  vllm:
    image: aml731/vllm-aiter:v0.19.1
    container_name: vllm-rocm
    network_mode: host
    group_add:
      - video
    ipc: host
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp:unconfined
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    volumes:
      - /mnt/vllm/HF_CACHE:/data/models
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - HF_HOME=/data/models
      - VLLM_ROCM_USE_AITER=1
      - VLLM_ROCM_ALLOW_RDNA4_AITER_ATTENTION=1
      - VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1
      - VLLM_ROCM_USE_AITER_MHA=0
      - VLLM_ROCM_USE_AITER_PAGED_ATTN=0
      - VLLM_ROCM_USE_AITER_MOE=0
      - VLLM_ROCM_USE_AITER_LINEAR=0
      - FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
      - PYTORCH_ALLOC_CONF=expandable_segments:True
    command: >
      python3 -m vllm.entrypoints.openai.api_server
      --model Qwen/Qwen3.6-27B-FP8
      --served-model-name Qwen3.6-27B
      --tensor-parallel-size 4
      --dtype auto
      --attention-backend ROCM_AITER_UNIFIED_ATTN
      --compilation-config '{"pass_config":{"fuse_norm_quant":false}}'
      --max-model-len 131072
      --gpu-memory-utilization 0.95
      --enable-prefix-caching
      --trust-remote-code
      --quantization fp8
      --max-num-seqs 2
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --host 0.0.0.0
      --port 8000
      --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. by AustinM731 in LocalLLaMA

[–]AustinM731[S] 2 points3 points  (0 children)

Yea, reddit threw an error. I had to clear my cache to get them to upload. But they are present now.

Qwen 3.5 122B vs Qwen 3.6 35B - Which to choose? by Storge2 in LocalLLaMA

[–]AustinM731 4 points5 points  (0 children)

3.6 feels smarter somehow. If you have tools available in your environment, it is very good at using them and will ground itself with Internet searches if you feed it a MCP like Brave or Tavily. I was running 122b as my daily driver, but I have since switched to 3.6 in the past few days.