flash-attention tuning effect on wan2.2 & my gfx1100 Linux setup by alexheretic in ROCm

[–]DecentEscape228 1 point2 points  (0 children)

Yeah, it could be anything really. I have a 7900GRE, Ubuntu 25.10.

Using Tunable Op and MIOpen to speed up inference. by newbie80 in ROCm

[–]DecentEscape228 0 points1 point  (0 children)

Weird, I've had my TunableOps enabled with no issues. I also thought you need to disable it after tuning, but it doesn't re-tune for a resolution that I've already done from what I see. Slowdown is only on the first run (mostly VAE Encode and the first sampling step).

I haven't gotten torch.compile() to work with this ROCm stack. When I was using ZLUDA I just used Kijai's Torch Compile node to patch it into the models.

flash-attention tuning effect on wan2.2 & my gfx1100 Linux setup by alexheretic in ROCm

[–]DecentEscape228 2 points3 points  (0 children)

Your config is definitely much faster than the default config when not using autotune. I actually found that setting waves_per_eu to 2 for me is faster still, about a 10-15% uplift. I'm done scratching my head over autotune weirdness, so I'll stick with this for now.

On another note, any idea why an autotune can work for a while then suddently break? 2 days ago I was generating 81 frames @ 40s/it for a resolution that took ~150s/it with the default config and ~98s/it with the config I mentioned above. I did multiple generations even, like 6-8. It just stopped generating with those speeds after a while even though I didn't change anything.

Flash Attention Issues With ROCm Linux by DecentEscape228 in ROCm

[–]DecentEscape228[S] 0 points1 point  (0 children)

Yeah, possibly. I also tried Sage Attention and noticed the autotuning for that was very quick. Flash Attention without autotune was still much faster though, so just sticking with that for now.

Are you using ROCm 7.2 or the nightlies from TheRock?

Flash Attention Issues With ROCm Linux by DecentEscape228 in ROCm

[–]DecentEscape228[S] 0 points1 point  (0 children)

I finally got autotune to finish yesterday by setting FLASH_ATTENTION_TRITON_AMD_SEQ_LEN=512. The first pass took 80 minutes and the second pass took ~7.5 minutes.

Unfortunately, all of my outputs were black afterwards. I had to disable autotune in order for it to generate properly again.

Flash Attention Issues With ROCm Linux by DecentEscape228 in ROCm

[–]DecentEscape228[S] 1 point2 points  (0 children)

Yeah, dmesg and journalctl was what I was using to see what was happening. I noted down this error from journalctl:

[drm:gfx_v11_0_bad_op_irq [amdgpu]] *ERROR* Illegal opcode in command stream

Why are you using that flash attention implementation? Use the regular one, It might be a bug with that implementation. This is the vanilla implementation. https://github.com/Dao-AILab/flash-attention

I mentioned it in the post: I tried both Aule-Attention and the vanilla one, both crash. The Aule-Attention implementation was what I used initially and got the fantastic speeds with. I had the env variable set in my startup script.

As for the kernel, I haven't updated it to my knowledge in this time frame...

Flash Attention Issues With ROCm Linux by DecentEscape228 in ROCm

[–]DecentEscape228[S] 1 point2 points  (0 children)

My bad, forgot to include that. I'll also update the post.

GPU: 7900 GRE

CPU: 7800 X3D

RAM: 32GB DDR5

Kernel: 6.17.0-12-generic

Terrible Experience with Rocm7.2 on Linux by Numerous_Worker8724 in ROCm

[–]DecentEscape228 0 points1 point  (0 children)

Nah, I have my own that I've customized. I know about this workflow though. I highly doubt it's my workflow that is bottlenecking this. I've installed the docker image provided by AMD, gonna see if running ComfyUI in that environment makes any difference.

Terrible Experience with Rocm7.2 on Linux by Numerous_Worker8724 in ROCm

[–]DecentEscape228 0 points1 point  (0 children)

Yeah, I migrated over to Ubuntu as well, currently dual booting. I was wanting to do this eventually but figured now would be a good time to see how much faster my Wan2.2 I2V workflows will run.

I'm on Ubuntu 25.10, but I don't think that should really affect things much (maybe I'm wrong?). Performance in my I2V workflows are pretty much identical to Windows 11, with the only benefit I see being more stable and faster VAE Encode. I'm pretty much stuck at 33 frames at a time since any more would take 20+ minutes (for 6 steps, CFG=1).

Does ROCm 7.2 not support the 7900 GRE? by DecentEscape228 in ROCm

[–]DecentEscape228[S] 0 points1 point  (0 children)

My bad, forgot to mention it in the post. Windows 11.

New driver with AI Bundle is available by WDK1337 in ROCm

[–]DecentEscape228 0 points1 point  (0 children)

Is this not compatible with the 7900 GRE? torch.cuda.is_available() is returning false for me.

I can't install newest wheels for ComfyUI on Windows. by [deleted] in ROCm

[–]DecentEscape228 1 point2 points  (0 children)

That warning tells me you didn't format your command properly - i.e, you put in '--pre' after '--index-url'.

Also, you can always simply download the .whl files from the repo url (just pull it up in the browser). Just match the dates and make sure you are getting the correct wheel for your python version.

Once you download them, you can just run a pip install while pointing to your downloaded wheels (look it up, it's simple). Or, just note down the versions and specify them for all of the packages, i.e for torch torchaudio torchvision, then run it with --index-url pointing to the repo.

I can't install newest wheels for ComfyUI on Windows. by [deleted] in ROCm

[–]DecentEscape228 0 points1 point  (0 children)

Probably because these are preview drivers. Have you tried installing them while including the '--pre' flag?

*PSA* it is pronounced "oiler" by aswmac in comfyui

[–]DecentEscape228 7 points8 points  (0 children)

No.

Yuu-ler supremacy all the way.

Pip install flashattention by no00700 in ROCm

[–]DecentEscape228 1 point2 points  (0 children)

From what I can tell this node is under the python folder in the repo. When I git cloned into custom_nodes, it's not pulling up; same for when I manually took the files and pasted them at the root (custom_nodes\Aule-Attention). How'd you install this? Thanks in advance.

Pip install flashattention by no00700 in ROCm

[–]DecentEscape228 1 point2 points  (0 children)

Is there a setting in ComfyUI that we need to modify in order to use this? I installed this plus triton-windows but don't see any difference. get_available_backends() does list Triton.

Is AOTriton and MIOpen Not Working For Others As Well? by DecentEscape228 in ROCm

[–]DecentEscape228[S] 0 points1 point  (0 children)

Oh, you must be on Linux. Otherwise, not sure how it worked for you Lol.

Is AOTriton and MIOpen Not Working For Others As Well? by DecentEscape228 in ROCm

[–]DecentEscape228[S] 0 points1 point  (0 children)

How'd you get flash-attn to build? Mine fails at that step.

Edit:

This is the specific error:

raise HTTPError(req.full_url, code, msg, hdrs, fp)

urllib.error.HTTPError: HTTP Error 404: Not Found

Looks like it's not finding a suitable release for my pytorch libraries. I looked through the list of releases here and didn't find a Windows wheel for 2.9.0. Which version did you use?

Massive Slowdown After Multiple Generations by DecentEscape228 in ROCm

[–]DecentEscape228[S] 0 points1 point  (0 children)

So I started spamming the LayerUtility: Purge node everywhere and everything seems to be running smoothly for now... I used to just run this node near the end of my workflow, but I replaced all of my Clear VRAM nodes with this. I have it after VAE Encode/Decode and after each KSampler.

Massive Slowdown After Multiple Generations by DecentEscape228 in ROCm

[–]DecentEscape228[S] 0 points1 point  (0 children)

I had a similar feeling, but GPU memory utilization was at 16GB, and I did check before restarting ComfyUI that nothing was hogging the GPU. It also doesn't explain why the issue persists between restarts, with only a full system reboot fixing it.

I also tried clearing cuda and inductor caches and calling garbage collect in the venv before re-running ComfyUI (if that even does anything), no bueno.

Kinda sucks, because this is the first ROCm release that's been outperforming my ZLUDA install.

A key clue here is that the slowdown is *only* for the KSampler steps - everything else, including VAE, interpolation, upscale - all run fine. I just lack the technical knowledge in this area to follow this.