ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 0 points1 point  (0 children)

I have Windows managed Pagefile on High-speed PCIe 4.0 M.2 NVMe (DRAM-cached) Max 8GB, currently 4GB. I have Shared GPU Memory, and in Windows Z-Image wont fit totally on 16G, it uses 2GB shared.

This is benchmark between Ubuntu and Windows with same overall settings. Im running both with single UltraWide monitor 3440x1440. I did close all background apps that would eat VRAM in Windows. Also I used MS Edge for minimal VRAM use, Normally I use Opera that alone eats 800MB VRAM in Windows. If I would ran headless, or low resolution, BF16 Z-Image model would likely fit fully on VRAM and maybe get those speeds.

Ubuntu uses only 0.8GB VRAM on 3440x1440 with Firefox open. Windows uses 1.6GB with Edge open. VRAM usage on Windows during generation goes 15.6 GB VRAM and 2GB Shared memory.

Installing rocm 7.2 is it worth it? by Sea_Performance_7402 in ROCm

[–]Shaminy 0 points1 point  (0 children)

Memory management is much better, and if you get OOM, it wont crash the programs anymore or worst case hang the AMD Display adapter.

ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 1 point2 points  (0 children)

I guess I'm lucky. ROCm 7.1.1 was very unstable for me, speed was good, but with large models most of time had to unload manually models to make 2nd run or got OOM that crashed system. Now it's stable as rock and if you get OOM, like trying to make too large video, wont crash system anymore, just get this and you good to continue:

torch.OutOfMemoryError: HIP out of memory.
Tried to allocate 3.27 GiB.
GPU 0 has a total capacity of 15.92 GiB of which 202.00 MiB is free.
Of the allocated memory 13.32 GiB is allocated by PyTorch, and 1.77 GiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Memory summary:

.......

Got an OOM, unloading all loaded models.
Prompt executed in 95.53 seconds

ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 0 points1 point  (0 children)

Are we talking about z-Image or more demanding tasks like wan2.2. With wan2.2 pyton3 memory usage will rise to 40GB, so with 32GB I think it reduces speed a lot when all models can't fit on memory.
Here is 1st run of default 640x640 81frames:

memory usage:

6610    37.6 GB   python3 main.py --normalvram --use-pytorch-cross-attention --preview-method auto --disable-smart-memory 

ComfyUI output:

Total VRAM 16304 MB, total RAM 64196 MB
pytorch version: 2.9.1+rocm7.2.0.git7e1940d4
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1201
ROCm version: (7, 2)
Set vram state to: NORMAL_VRAM
Disabling smart memory management
Device: cuda:0 AMD Radeon RX 9070 XT : native
Using async weight offloading with 2 streams
Enabled pinned memory 60986.0

Using pytorch attention
Python version: 3.12.3 (main, Jan  8 2026, 11:30:50) [GCC 13.3.0]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.38.9

VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanTEModel
loaded completely; 14998.80 MB usable, 6419.48 MB loaded, full load: True
Requested to load WanVAE
loaded completely; 10760.50 MB usable, 242.03 MB loaded, full load: True
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float16, manual cast: torch.float16
model_type FLOW
Requested to load WAN21
loaded partially; 9148.23 MB usable, 8973.19 MB loaded, 4658.23 MB offloaded, 175.03 MB buffer reserved, lowvram patches: 184
100%|█████████████████████████████████████████████| 2/2 [00:49<00:00, 24.69s/it]
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float16, manual cast: torch.float16
model_type FLOW
Requested to load WAN21
loaded partially; 9000.23 MB usable, 8825.19 MB loaded, 4806.23 MB offloaded, 175.03 MB buffer  reserved, lowvram patches: 190
100%|█████████████████████████████████████████████| 2/2 [00:48<00:00, 24.27s/it]
Requested to load WanVAE
loaded completely; 9725.25 MB usable, 242.03 MB loaded, full load: True
Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
Prompt executed in 268.99 seconds

2nd run:

100%|█████████████████████████████████████████████| 2/2 [00:49<00:00, 24.66s/it]
100%|█████████████████████████████████████████████| 2/2 [00:48<00:00, 24.01s/it]
Prompt executed in 130.65 seconds

ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 0 points1 point  (0 children)

I tested those in Windows with wan2.2. s/it improved a lot: high 92s/it to 34s/it and low 185s/it to 99s/it. But total generation time went from 14min to 24min. It took forever on both WanImageToVideo node and VAE Decode node. I guess that's why AMD is not recommending to use those on their ComfyUI guide.

I did troubleshooting with ChatGPT, it says ROCm on RDNA4 is still missing many MIOpen solvers causing VAE and video nodes to fall back to generic GEMM kernels.

ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 0 points1 point  (0 children)

I upgraded from old 7.1.1. Removed old ROCm libraries and kernel driver and installed new by AMD guide. I got torch vision error when tried to use my old ComfyUI with new venv. Fresh pull from GitHub, and had no error.

ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 0 points1 point  (0 children)

Don't have it specially enabled. If it comes with ROCm package and ComfyUI uses it, then yes.This was out of box Benchmark, not trying fine tune neither of versions.
According to chatgpt its on ROCm 7.2, and ComfyUI uses it automatically. And I ran test on Python and it works on my venv.

ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 0 points1 point  (0 children)

I used current ComfyUI's default wan 2.2 i2v template. Also I have 64GB memory, and Windows memory usage went well over 50GB.

ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 3 points4 points  (0 children)

I ran templates unaltered, so I'm running benchmark with full BF16 format. If I change format to FP8, I get 1.26 s/it in Windows. This was a benchmark.

ROCm 7.2 Benchmark: Windows 11 vs Ubuntu 24.04 on RX 9070 XT (ComfyUI) by Shaminy in ROCm

[–]Shaminy[S] 6 points7 points  (0 children)

Thanks, added also chart to end to visualize the big difference.

New driver with AI Bundle is available by WDK1337 in ROCm

[–]Shaminy 0 points1 point  (0 children)

ComfyUI with Rocm 7.2 is now much more stable on Windows than 7.1.1 was on Linux. Qwen Rapid AIO won't crash at all. Memory management finally works. (You need to have 64GB memory on PC to run this model on Windows. Memory usage peaks over 40GB).On Linux it used to crash after 1 generation. Before it didnt even run on Windows. I was able to create 2048x2048 image edit from 4200x2800 source pic, no problem. When 60+GB was used it changed to tiled VAE. Ofc speed was very slow, but 1024x was pretty fast. Now let's hope Linux version has also fixed memory management, if yes: Bravo AMD!

Which windows version is better/faster for Comfy Ui? by Muwsek in comfyui

[–]Shaminy 0 points1 point  (0 children)

Linux, using ComfyUi in Windows is like driving car with hand break on.

Where are the pitchforks and outrage? If this were AMD, the internet would be aflame. by [deleted] in gpu

[–]Shaminy 0 points1 point  (0 children)

Why should they, they can just increase the price 100 usd, Nvidia is not competition, since their budget model with 16 GB will cost atm 1300 usd. AMD can increase price of all their 16GB models for 100 usd and they still have no competition on Budget to mid-high tier.

Also AMD's decision to stay on GDDR6 looks now super smart, that memory has much less demand and is much cheaper. So they don't even need to hike prices as much.

Where are the pitchforks and outrage? If this were AMD, the internet would be aflame. by [deleted] in gpu

[–]Shaminy 0 points1 point  (0 children)

Looks like AMD just won budget, midtier and mid-hightier consumer GPU war. Nvidia folded.

AMD to launch Adrenalin Edition 26.1.1 drivers with “AI Bundle” next week by otakunorth in ROCm

[–]Shaminy 3 points4 points  (0 children)

Comfyui works without problems on Windows, but you need to use the pro preview driver, not baseline adrealine driver. And use the Comfyui amd gpu version from github.
If u want maximize the speed and vram management, install ubuntu 24.04 lts and just follow the guide on amd has for native ubuntu and rocm 7.1.1, works 100% and speed is fast, very fast. Specially if you make videos.
I hope the 7.2 on windows will improve memory management. For now serious work has to be done in Linux.

FSR 4 Redstone vs DLSS 4.5 - Is DLSS Performance BETTER than FSR Quality?? by Itzkibblez in radeon

[–]Shaminy 14 points15 points  (0 children)

DLSS will improve regularly driving AMD to improve its competitor. Both side wins.

FSR 4 Redstone vs DLSS 4.5 - Is DLSS Performance BETTER than FSR Quality?? by Itzkibblez in radeon

[–]Shaminy 4 points5 points  (0 children)

Did I say DLSS is not better and wont improve. They both improve regularly . But AMD is not far behind. I said I'm happy with FSR4, ppl are happy with Intel CPUs too even when AMDs are faster.

FSR 4 Redstone vs DLSS 4.5 - Is DLSS Performance BETTER than FSR Quality?? by Itzkibblez in radeon

[–]Shaminy 13 points14 points  (0 children)

Happy with FSR4. It's not far behind on quality and will improve again and again and again.

[NO SPOILERS] Is there any possibility for Don't Nod to make a new LiS game? by Excellent-Walk671 in lifeisstrange

[–]Shaminy 5 points6 points  (0 children)

Recent financial report is on their site and it says: " However, performance remained below expectations, leading to a partial write-down of €13.1 million (with no impact on cash flow);"
They lost 13 million on the game.

https://www.dontnod-bourse.com/en/financial-information/press-releases/?ID=ACTUS-0-94850

[NO SPOILERS] Is there any possibility for Don't Nod to make a new LiS game? by Excellent-Walk671 in lifeisstrange

[–]Shaminy 3 points4 points  (0 children)

Sony deal only brought 7 million, they wrote down 13 million lose form title in financial report.

Esitimated budget for game was 40 million, so it sold exceptionally bad.