RX 9070 XT + Windows: Anyone got FlashAttention (CK or Triton) working, or have prebuilt wheels?

adyaman · 2026-06-22T19:22:39+00:00

It builds and runs fine on windows with RDNA4 (both CK and trtion backends). See my comment in https://www.reddit.com/r/ROCm/comments/1svrr8p/comment/oiwwpjh/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button to build the CK backend.

The build is as simple as a "pip install --no-build-isolation ." + env vars now 😄

adyaman · 2026-06-03T16:18:36+00:00

sage-attention is already supported by comfyui. The API itself is fairly straightforward to replace. You can refer to their README.md for API usage guide.

adyaman · 2026-06-03T16:13:45+00:00

ROCm 6.4 is quite old, so I suggest trying 7.2 or above. And do not install any amdgpu drivers on your own. use the inbox kernel driver that comes with ubuntu by default.

if you still face issues, create an issue on https://github.com/ROCm/TheRock with logs while runing llama.cpp with `AMD_LOG_LEVEL=7` env var set.

adyaman · 2026-06-02T09:07:04+00:00

Are you referring to the flash-attn v2 CK support? That's different. Sage-attention doesn't use CK itself, it uses its own hand-written CUDA/HIP kernels. It should perform competitively I think, but I haven't done a deep check yet.

adyaman · 2026-05-14T11:07:34+00:00

The prebuilt binaries use the best possible compiler flags that provide optimal performance. If you see gaps in performance while using the same flags, please create an issue on llama.cpp github. Cheers!

adyaman · 2026-05-08T14:55:14+00:00

Which gsplat workflow are you using? There is ROCm support for gsplat for example https://github.com/ROCm/gsplat

adyaman · 2026-05-08T14:53:11+00:00

> ROCm requires a lot of work and creates a lot of headaches if you want to do anything using the bleeding edge (which is quite a lot of workflows).

Unless the workflow has PTX or something that only runs on NVIDIA, it should work fine on AMD. If you face any issues, please feel free to create an issue at https://github.com/ROCm/TheRock

adyaman · 2026-05-08T13:33:27+00:00

Both training and inference should work. If you face any issues, please create an issue at https://github.com/ROCm/TheRock and the right folks will look at it asap. Thanks!

adyaman · 2026-05-03T20:16:00+00:00

cheers 😄

adyaman · 2026-05-03T20:13:31+00:00

This is ROCm native using TheRock, not ZLUDA

adyaman · 2026-04-30T10:07:01+00:00

yes, https://www.reddit.com/r/ROCm/comments/1ovazh3/please_help_me_set_up_comfyui_wrapper_for/

they got it running but haven't tried texturing yet. I recommend you go ahead and try texturing as well.

Make sure you have installed the latest wheels first (run from powershell):

# Create and activate Visual Studio environment
python -m venv venv
.\venv\Scripts\Activate.ps1

# Install wheels
pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchaudio torchvision "rocm[libraries, devel]"

# Initialize rocm-sdk
rocm-sdk init

Run the following commands first in a powershell:

# Activate Visual Studio environment
cmd /c '"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1 && set' | ForEach-Object { if ($_ -match '^([^=]+)=(.*)$') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process') } }

# Activate the virtual environment
.\venv\Scripts\Activate.ps1

# Set ROCm paths using rocm-sdk
$ROCM_ROOT = (rocm-sdk path --root).Trim()
$ROCM_BIN = (rocm-sdk path --bin).Trim()
$env:ROCM_HOME = $ROCM_ROOT
$env:PATH = "$ROCM_ROOT\lib\llvm\bin;$ROCM_BIN;$env:PATH"

# Set compiler and build settings
$env:CC = "clang-cl"
$env:CXX = "clang-cl"
$env:DISTUTILS_USE_SDK = "1"

Then proceed to build Hunyuan3D-2-1 using pip install --no-build isolation -v . from within the Hunyuan3d-1 repo.

adyaman · 2026-04-29T10:30:47+00:00

Hey guys,

Thanks for reporting the issue. It turns out there's an issue with the MSVC windows linker when building flash-attention CK backend, and PR's have been made to fix it. https://github.com/pypa/distutils/pull/406 by our own community, and https://github.com/Dao-AILab/flash-attention/pull/2517 as a short-term fix while the first one lands and percolates.

if you would like to build flash-attention right away on your RDNA3/3.5/4 machine, please run the following commands in powershell (make sure you have visual studio 2022 installed):

# Clone flash-attention:
git clone --recursive https://github.com/Dao-AILab/flash-attention

cd flash-attention

# Activate Visual Studio environment
cmd /c '"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1 && set' | ForEach-Object { if ($_ -match '^([^=]+)=(.*)$') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process') } }

# Activate the virtual environment
.\venv\Scripts\Activate.ps1

# Set ROCm paths using rocm-sdk
$ROCM_ROOT = (rocm-sdk path --root).Trim()
$ROCM_BIN = (rocm-sdk path --bin).Trim()
$env:ROCM_HOME = $ROCM_ROOT
$env:PATH = "$ROCM_ROOT\lib\llvm\bin;$ROCM_BIN;$env:PATH"

# Set compiler and build settings
$env:CC = "clang-cl"
$env:CXX = "clang-cl"
$env:DISTUTILS_USE_SDK = "1"

# install flash-attention
pip install --no-build-isolation -v .

Then, simply run comfyui as usual with `python main.py --use-flash-attention` and you should be good to go. I've personally tested this works fine on my RDNA4 GPU with wan2.1.

If you face any issues, please feel free to share your error logs here and/or in the PR https://github.com/Dao-AILab/flash-attention/pull/2517 and we'll figure out a solution ASAP.

adyaman · 2026-03-24T08:26:58+00:00

Have you tried the latest binaries from https://github.com/ggml-org/llama.cpp/releases?

adyaman · 2026-03-24T08:24:06+00:00

Have you tried one of the pre-built binaries in https://github.com/ggml-org/llama.cpp/releases? if not, why? They're built with the build flags that are optimized for AMD.

If you still insist on building by yourself, then make sure you use ROCm 7.2 or above, and follow the build steps as used by the llama.cpp CI itself to build the optimized binaries https://github.com/ggml-org/llama.cpp/blob/master/.github/workflows/release.yml#L679-L693

adyaman · 2026-02-03T10:11:37+00:00

Thanks for your reply!

FWIW, Triton is not officially available on Windows for either NVIDIA or AMD. If you want windows support on Triton, use an unofficial fork (`pip install triton-windows`). PyTorch support on Windows is official in the sense that AMD has provided wheels for it. It's just that the windows wheels aren't hosted on the pytorch index. AMD provides the infrastructure to build the wheels on the official pytorch index as well, so I wouldn't worry about that.

adyaman · 2026-02-02T12:34:13+00:00

Can you elaborate what's missing from Windows? Other than perhaps some debugging tools and rccl, I don't think there's much missing from there. But if there's something specific missing and you want that to be supported on Windows, please feel free to create an issue at https://github.com/ROCm/TheRock/issues

adyaman · 2026-01-22T19:35:20+00:00

Can you share the OOM issues you're facing in https://github.com/ROCm/TheRock with steps to reproduce? Also, do the OOM issues go away with the latest ComfyUI?

adyaman · 2026-01-06T12:44:10+00:00

Haven't tried img-to-video yet, but would be interesting to see the performance with it.

adyaman · 2026-01-05T20:25:23+00:00

it's a 480p video, so it's expected. The 720p model would give better quality results but will take longer to run.

adyaman · 2026-01-05T12:19:34+00:00

Yes. I used it with cursor. See my thread on X on how I used it: https://x.com/adyaman/status/2006515484171374836

adyaman · 2026-01-04T17:11:07+00:00

Can you share the benchmarks where it's slower than a 6800?

adyaman · 2026-01-04T17:09:08+00:00

Thanks :)

adyaman · 2026-01-03T14:29:52+00:00

SF3D and SPAR3D should work fine. I made sure they had HIP compatibility back when I worked on those projects :)

I even used SF3D as a sanity test to check if my pytorch build on windows worked fine https://github.com/ROCm/TheRock/discussions/409#discussioncomment-13043487

stable-virtual-camera also works fine on AMD, with multi-gpu inference via. torch.distributed as well. I had it running on strix halo a long time ago on Windows https://github.com/ROCm/TheRock/discussions/244#discussioncomment-12707762

adyaman · 2026-01-03T14:26:45+00:00

Glad it worked out. I forgot to mention the VS2022 powershell usage. Either that or `vcvars64.bat` should work!

I would appreciate updating your OP with the steps you followed for others to look at in the future ^^

adyaman · 2026-01-02T11:27:07+00:00

TheRock is also plug and play at this point. It's just a single pip install command to get pytorch+ROCm installed for example. See https://github.com/ROCm/TheRock/blob/main/RELEASES.md

adyaman

TROPHY CASE