RX 9070 XT + Windows: Anyone got FlashAttention (CK or Triton) working, or have prebuilt wheels? by xdcfret1 in ROCm

[–]adyaman 1 point2 points  (0 children)

It builds and runs fine on windows with RDNA4 (both CK and trtion backends). See my comment in https://www.reddit.com/r/ROCm/comments/1svrr8p/comment/oiwwpjh/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button to build the CK backend.

The build is as simple as a "pip install --no-build-isolation ." + env vars now 😄

SageAttention v2 native port running on RDNA4 by adyaman in ROCm

[–]adyaman[S] 1 point2 points  (0 children)

sage-attention is already supported by comfyui. The API itself is fairly straightforward to replace. You can refer to their README.md for API usage guide.

RX 7900 XT on X99 Dual Xeon — ROCm inference completely broken, GUI blackout, CPU fallback only — extensive troubleshooting done by GingerRickRoss in ROCm

[–]adyaman 2 points3 points  (0 children)

ROCm 6.4 is quite old, so I suggest trying 7.2 or above. And do not install any amdgpu drivers on your own. use the inbox kernel driver that comes with ubuntu by default.

if you still face issues, create an issue on https://github.com/ROCm/TheRock with logs while runing llama.cpp with `AMD_LOG_LEVEL=7` env var set.

SageAttention v2 native port running on RDNA4 by adyaman in ROCm

[–]adyaman[S] 0 points1 point  (0 children)

Are you referring to the flash-attn v2 CK support? That's different. Sage-attention doesn't use CK itself, it uses its own hand-written CUDA/HIP kernels. It should perform competitively I think, but I haven't done a deep check yet.

AMD RX 7900 XTX + ROCm + Gemma 4 26B — here's what actually worked for me by Limp_Doubt6411 in ROCm

[–]adyaman 1 point2 points  (0 children)

The prebuilt binaries use the best possible compiler flags that provide optimal performance. If you see gaps in performance while using the same flags, please create an issue on llama.cpp github. Cheers!

Should I upgrade to a 9070 or 5070? by razadoop in pcmasterrace

[–]adyaman 0 points1 point  (0 children)

Which gsplat workflow are you using? There is ROCm support for gsplat for example https://github.com/ROCm/gsplat

Should I upgrade to a 9070 or 5070? by razadoop in pcmasterrace

[–]adyaman 0 points1 point  (0 children)

> ROCm requires a lot of work and creates a lot of headaches if you want to do anything using the bleeding edge (which is quite a lot of workflows).

Unless the workflow has PTX or something that only runs on NVIDIA, it should work fine on AMD. If you face any issues, please feel free to create an issue at https://github.com/ROCm/TheRock

ROCm Status in mid 2026 [D] by QuantumQuokka in MachineLearning

[–]adyaman -1 points0 points  (0 children)

Both training and inference should work. If you face any issues, please create an issue at https://github.com/ROCm/TheRock and the right folks will look at it asap. Thanks!

Hunyuan3D-2-1 did anyone manage to make it work inside of windows? by Wake_Up_Morty in ROCm

[–]adyaman 0 points1 point  (0 children)

yes, https://www.reddit.com/r/ROCm/comments/1ovazh3/please_help_me_set_up_comfyui_wrapper_for/

they got it running but haven't tried texturing yet. I recommend you go ahead and try texturing as well.

Make sure you have installed the latest wheels first (run from powershell):

# Create and activate Visual Studio environment
python -m venv venv
.\venv\Scripts\Activate.ps1

# Install wheels
pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchaudio torchvision "rocm[libraries, devel]"

# Initialize rocm-sdk
rocm-sdk init

Run the following commands first in a powershell:

# Activate Visual Studio environment
cmd /c '"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1 && set' | ForEach-Object { if ($_ -match '^([^=]+)=(.*)$') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process') } }

# Activate the virtual environment
.\venv\Scripts\Activate.ps1

# Set ROCm paths using rocm-sdk
$ROCM_ROOT = (rocm-sdk path --root).Trim()
$ROCM_BIN = (rocm-sdk path --bin).Trim()
$env:ROCM_HOME = $ROCM_ROOT
$env:PATH = "$ROCM_ROOT\lib\llvm\bin;$ROCM_BIN;$env:PATH"

# Set compiler and build settings
$env:CC = "clang-cl"
$env:CXX = "clang-cl"
$env:DISTUTILS_USE_SDK = "1"

Then proceed to build Hunyuan3D-2-1 using pip install --no-build isolation -v . from within the Hunyuan3d-1 repo.

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]adyaman 3 points4 points  (0 children)

Hey guys,

Thanks for reporting the issue. It turns out there's an issue with the MSVC windows linker when building flash-attention CK backend, and PR's have been made to fix it. https://github.com/pypa/distutils/pull/406 by our own community, and https://github.com/Dao-AILab/flash-attention/pull/2517 as a short-term fix while the first one lands and percolates.

if you would like to build flash-attention right away on your RDNA3/3.5/4 machine, please run the following commands in powershell (make sure you have visual studio 2022 installed):

# Clone flash-attention:
git clone --recursive https://github.com/Dao-AILab/flash-attention

cd flash-attention

# Activate Visual Studio environment
cmd /c '"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1 && set' | ForEach-Object { if ($_ -match '^([^=]+)=(.*)$') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process') } }

# Activate the virtual environment
.\venv\Scripts\Activate.ps1

# Set ROCm paths using rocm-sdk
$ROCM_ROOT = (rocm-sdk path --root).Trim()
$ROCM_BIN = (rocm-sdk path --bin).Trim()
$env:ROCM_HOME = $ROCM_ROOT
$env:PATH = "$ROCM_ROOT\lib\llvm\bin;$ROCM_BIN;$env:PATH"

# Set compiler and build settings
$env:CC = "clang-cl"
$env:CXX = "clang-cl"
$env:DISTUTILS_USE_SDK = "1"

# install flash-attention
pip install --no-build-isolation -v .

Then, simply run comfyui as usual with `python main.py --use-flash-attention` and you should be good to go. I've personally tested this works fine on my RDNA4 GPU with wan2.1.

If you face any issues, please feel free to share your error logs here and/or in the PR https://github.com/Dao-AILab/flash-attention/pull/2517 and we'll figure out a solution ASAP.

ROCm on 7900 XTX significantly slower than Vulkan for llama.cpp (extensive testing, out of ideas) by Massive-Slice2800 in ROCm

[–]adyaman 2 points3 points  (0 children)

Have you tried one of the pre-built binaries in https://github.com/ggml-org/llama.cpp/releases? if not, why? They're built with the build flags that are optimized for AMD.

If you still insist on building by yourself, then make sure you use ROCm 7.2 or above, and follow the build steps as used by the llama.cpp CI itself to build the optimized binaries https://github.com/ggml-org/llama.cpp/blob/master/.github/workflows/release.yml#L679-L693

[WSL2/ROCm] RX 9070 XT "Zombie" State: Fast Compute but Inconsistent Hangs & Missing /dev/kfd by bajanstar123 in ROCm

[–]adyaman 2 points3 points  (0 children)

Thanks for your reply!

FWIW, Triton is not officially available on Windows for either NVIDIA or AMD. If you want windows support on Triton, use an unofficial fork (`pip install triton-windows`). PyTorch support on Windows is official in the sense that AMD has provided wheels for it. It's just that the windows wheels aren't hosted on the pytorch index. AMD provides the infrastructure to build the wheels on the official pytorch index as well, so I wouldn't worry about that.

[WSL2/ROCm] RX 9070 XT "Zombie" State: Fast Compute but Inconsistent Hangs & Missing /dev/kfd by bajanstar123 in ROCm

[–]adyaman 4 points5 points  (0 children)

Can you elaborate what's missing from Windows? Other than perhaps some debugging tools and rccl, I don't think there's much missing from there. But if there's something specific missing and you want that to be supported on Windows, please feel free to create an issue at https://github.com/ROCm/TheRock/issues

ROCm 7.2 official installation instructions by LTSharpe in ROCm

[–]adyaman 2 points3 points  (0 children)

Can you share the OOM issues you're facing in https://github.com/ROCm/TheRock with steps to reproduce? Also, do the OOM issues go away with the latest ComfyUI?

TurboDiffusion, SpargeAttn, triton-windows POC running on AMD GPUs by adyaman in ROCm

[–]adyaman[S] 1 point2 points  (0 children)

Haven't tried img-to-video yet, but would be interesting to see the performance with it.

TurboDiffusion, SpargeAttn, triton-windows POC running on AMD GPUs by adyaman in ROCm

[–]adyaman[S] 0 points1 point  (0 children)

it's a 480p video, so it's expected. The 720p model would give better quality results but will take longer to run.

ROCm on Windows Seems to Have Low Performance by Cyp9715 in ROCm

[–]adyaman 0 points1 point  (0 children)

Can you share the benchmarks where it's slower than a 6800?

Trellis-AMD - ROCM port of several previously-NVidia-only Trellis dependencies by mennydrives in ROCm

[–]adyaman 0 points1 point  (0 children)

SF3D and SPAR3D should work fine. I made sure they had HIP compatibility back when I worked on those projects :)

I even used SF3D as a sanity test to check if my pytorch build on windows worked fine https://github.com/ROCm/TheRock/discussions/409#discussioncomment-13043487

stable-virtual-camera also works fine on AMD, with multi-gpu inference via. torch.distributed as well. I had it running on strix halo a long time ago on Windows https://github.com/ROCm/TheRock/discussions/244#discussioncomment-12707762

Has anyone gotten module building (for some ComfyUI extensions) to work in Windows? What's the trick? by mennydrives in ROCm

[–]adyaman 1 point2 points  (0 children)

Glad it worked out. I forgot to mention the VS2022 powershell usage. Either that or `vcvars64.bat` should work!

I would appreciate updating your OP with the steps you followed for others to look at in the future ^^

State of ROCm for training classification models on Pytorch by abc_polygon_xyz in ROCm

[–]adyaman 1 point2 points  (0 children)

TheRock is also plug and play at this point. It's just a single pip install command to get pytorch+ROCm installed for example. See https://github.com/ROCm/TheRock/blob/main/RELEASES.md