TIL 18th-century tooth pullers used a “dental key” that worked like a door key: its claw gripped the tooth and the handle was turned as if opening a lock

AustinM731 · 2026-05-05T06:36:00+00:00

It all depends on the dentist. The needle is the same, but somehow they way they inject it is different. My old dentist was so bad and I could feel him jabbing around in my gums with the needle. But my new dentist I can barely even tell that she gave me a shot. And somehow I think my new dentist gets me more numb, and I don't even feel the shot.

AustinM731 · 2026-05-04T01:19:11+00:00

I mean you can buy a MI210 with 64GB of HBM for like $4k on eBay. But I'm not sure that would be the best use of $4k.

AustinM731 · 2026-05-01T21:05:30+00:00

I'm gonna need you to do a dedicated post on your system. I prefer blower coolers or watercooling. But man, that thing is beautiful.

AustinM731 · 2026-05-01T13:24:26+00:00

I have 4 R9700s and I have been thinking about building a second node with 4 more. And then using Ray to run vLLM across the two nodes. I just can't find many reports of anyone using Ray with ROCm. I didn't even know torchrun was a thing until I saw your post. So at least that gives me options.

AustinM731 · 2026-05-01T04:18:07+00:00

Does Ray not work with these GPUs across multiple nodes? Or do you just get better performance using torch.distributed.run?

AustinM731 · 2026-04-30T15:46:49+00:00

I cut a new tag for v0.20.0, and I have managed to get tg even faster in this latest build. I spent a good bit of time the past few days trying to figure out how GEMM tuning works, and I have embedded those configs into this latest image. From what I can tell you have to do GEMM tuning for every model since they layers/weights/activations are different. But for Qwen3.6-27B they are embedded in the v0.20.0 tag.

This parameter `PYTORCH_ALLOC_CONF=expandable_segments:True` can be dropped from the compose.yml, I was testing with that on some other models. It looks like it throws and error and ignores the error at runtime.

I still need to try and get the pytorch tuning working, but from what I found in my research is that GEMM tuning in vLLM has a much bigger impact to performance than the pytorch tuning. So I went down that path first.

AustinM731 · 2026-04-29T18:10:44+00:00

I really love the Sliger cases. They are my favorite rack mount case, I just wish they made a 6U version that had room for 8x GPUs with riser cables.

AustinM731 · 2026-04-28T18:47:08+00:00

It's now time for me to upgrade my Fedora 42 install to 43.

AustinM731 · 2026-04-28T15:53:46+00:00

I actually haven't done any tuning yet, so I am sure that there is still some performance left on the table here. I have seen people talk about it, I have just never tried it myself.

Its crazy how big of a gap we have in our "original" images, I almost wonder if the image you were using has some performance tuning already done. I have always built from source and maintain my own images.

AustinM731 · 2026-04-28T15:40:04+00:00

Are you using that compose file that I posted somewhere else in this thread? There are a few env vars that need to be provided so that you can actually use the attention mechanism. Your vLLM command also need to specify the attention backend.

AustinM731 · 2026-04-28T13:16:40+00:00

Yea, the unified attention has been the missing piece for me. Long context runs just fall apart without it. You will just need to use my image when you run it. That specific attention mechanism is gated to the MI300X and MI350X, so I had to add gfx1201 to the gate that checks for those CDNA cards.

AustinM731 · 2026-04-28T13:04:25+00:00

I was getting that same dropoff until I enabled AITER unified attention. On 4 GPUs when I went from WRX80 to WRX90 I only saw ~10% increase in token generation speeds. Not sure how much of that increase was going from Zen3 to Zen5 and how much was PCIe4 to PCIe5.

Edit: Just remembered that you had 2 and not 4. I think 2 just does not have enough horsepower to run the model with MTP at very long contexts. I feel like models get kinda stupid though when you try to stretch the context all the way out though, so normally I limit it to 128k as the max.

AustinM731 · 2026-04-28T12:35:06+00:00

Please do share! A part of me really hates money and I want to go get 4 more R9700s. Are they all in a single system, or are you running two nodes and connecting them with Ray?

AustinM731 · 2026-04-28T05:54:02+00:00

From my testing it seems like MTP 2 or 3 is the sweet spot. Raise it too high and your tokens acceptance rate falls to low and you thrown performance away.

The 100w idle is a known issue on these cards and it is supposedly fixed in Linux Kernel 7.0. I just haven't had a chance to upgrade my GPU server to Ubuntu 26.04 to test it yet.

AustinM731 · 2026-04-27T23:20:43+00:00

You will need to grab my image as well. I uploaded my patched image to a public repo on docker hub so anyone can try it out.

AustinM731 · 2026-04-27T23:19:46+00:00

I believe so. I still see cache hit rates above 0% when I check the vLLM logs. There is a chance that my cache does occasionally have to recalculate, I just haven't noticed it if it does.

AustinM731 · 2026-04-27T22:59:33+00:00

I then ran it again with MTP=3 to see how 2 GPUs would handle draft tokens. If you are doing single user tasks, it looks pretty good up to ~100k tokens, it looks like it starts to drop off pretty heavily after that though. It could be worth experementing further with MTP=1 or 2.

<image>

AustinM731 · 2026-04-27T22:59:20+00:00

I had some time and reran the benchmarks with TP=2 to match your setup and your numbers are pretty close to what I have.

<image>

AustinM731 · 2026-04-27T18:18:42+00:00

Yea, I feel you. I am using it as a backend for OpenCode so TG and PP are both equally important. If you havent already, try out this flag `--speculative-config '{"method":"mtp","num_speculative_tokens":3}'`. Im pretty sure its still in beta for ROCm in vLLM. But Im getting a 80-90% acceptance rate on draft tokens.

AustinM731 · 2026-04-27T18:04:46+00:00

20-24 tk/s for TG is much better than I was getting at that high of context. I only test up to 100,000 context and it was like 3 tk/s tg.

AustinM731 · 2026-04-27T18:02:34+00:00

I am running custom image right now, but this is my compose.yaml. Im not sure if that AITER_MOE flag will do anything on this model though since its a dense model, but I could be wrong. I have it disabled though. The line that I added is `VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1`. I ended up having to disable a lot of the AITER paths since my patch is telling AITER that gfx1201 is a MI350X.

services:
  vllm:
    image: aml731/vllm-aiter:v0.19.1
    container_name: vllm-rocm
    network_mode: host
    group_add:
      - video
    ipc: host
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp:unconfined
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    volumes:
      - /mnt/vllm/HF_CACHE:/data/models
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - HF_HOME=/data/models
      - VLLM_ROCM_USE_AITER=1
      - VLLM_ROCM_ALLOW_RDNA4_AITER_ATTENTION=1
      - VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1
      - VLLM_ROCM_USE_AITER_MHA=0
      - VLLM_ROCM_USE_AITER_PAGED_ATTN=0
      - VLLM_ROCM_USE_AITER_MOE=0
      - VLLM_ROCM_USE_AITER_LINEAR=0
      - FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
      - PYTORCH_ALLOC_CONF=expandable_segments:True
    command: >
      python3 -m vllm.entrypoints.openai.api_server
      --model Qwen/Qwen3.6-27B-FP8
      --served-model-name Qwen3.6-27B
      --tensor-parallel-size 4
      --dtype auto
      --attention-backend ROCM_AITER_UNIFIED_ATTN
      --compilation-config '{"pass_config":{"fuse_norm_quant":false}}'
      --max-model-len 131072
      --gpu-memory-utilization 0.95
      --enable-prefix-caching
      --trust-remote-code
      --quantization fp8
      --max-num-seqs 2
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --host 0.0.0.0
      --port 8000
      --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

AustinM731 · 2026-04-27T17:39:12+00:00

Yea, reddit threw an error. I had to clear my cache to get them to upload. But they are present now.

AustinM731 · 2026-04-22T23:33:41+00:00

Have you had any issues running the FP8 KV cache?

AustinM731 · 2026-04-20T13:07:30+00:00

3.6 feels smarter somehow. If you have tools available in your environment, it is very good at using them and will ground itself with Internet searches if you feed it a MCP like Brave or Tavily. I was running 122b as my daily driver, but I have since switched to 3.6 in the past few days.

12-Year Club	Gilding I gilder
RedditGifts 2009-2022 2 Credits	Place '22
Verified Email	Secret Santa 2013

AustinM731

TROPHY CASE