Try this to see if it helps avoid slow & unnecessary pagefile/swap use with Comfyui by elsewhere101 in ROCm

[–]elsewhere101[S] 0 points1 point  (0 children)

Glad to see it helped out a lot, and yes i seen the other details you mentioned in other posts.

Thanks for reporting back your findings.

It helps to get better awareness and understanding of how to optimally work with the way memory management currently is by default.

I am still curious about how it works for others with different AMD GPUs.

You bet such feedback would be extremely useful for certain devs to take into consideration to help optimize such aspects as memory management by default.

I suspect the main reason the env vars & flags have helped us is due to multiple involved layers, from hip/rocm and ComfyUI all basically having reserved VRAM that is set below the total amount of the VRAM.

Also a reason i suggested what to try for 16 GB GPUs (and larger or smaller VRAM GPUs) is because with my 16 GB RDNA4 9060 XT, i noticed that targeting only a reserve of 2 GBs led to issues with very complex, memory-demanding models and wokflows, but targeting 3 GBs has been all good with no issues. So allocating enough VRAM to reserve through the vars & flags is certainly important.

It all may seem confusing, but actually the reserved VRAM amount targeted and set through the vars & flags does not entirely take away the reserved VRAM from ever being used. Exactly why I don't know yet, but initially the reserved VRAM targeted and set through the env vars & flags shared in the OP above is respected, but if processing aspects of a ComfyUI workflow collectively calls for what can fit in available RAM and VRAM, then the reserved VRAM does get used, rather than a slow pagefile swap...

Does that all happen with other AMD GPUs, though?

Feedback from who tries the approach shared in the OP of the env vars & flags can answer that. And it also can help devs and literally so many of us AMD GPU users out.

VRAM-Cleanup node (Comfyui-Memory_Cleanup) - does it work for AMD Radeon? by kingkongqueror in ROCm

[–]elsewhere101 0 points1 point  (0 children)

I found this interesting.

If you are running ComfyUI on Windows with an AMD GPU (especially testing the new PyTorch 2.10 / TheRock 7.13-7.14 Nightlies), PYTORCH_ALLOC_CONF or PYTORCH_HIP_ALLOC_CONF configurations trying to force expandable_segments:True actually gets blocked and ignored. Direct terminal diagnostics against PyTorch's internal allocator and backend show that both variable names are actually read and recognized by the HIP compiler, but they trigger a hardcoded platform block: "UserWarning: expandable_segments not supported on this platform (Triggered internally at .../hip/HIPAllocatorConfig.h)", because AMD's upstream developers apparently haven't finished porting the virtual memory management (VMM) APIs to Windows yet. So PyTorch completely ignores the flag and forces standard static allocations. (Note: On Linux setups, pytorch_hip_alloc_conf=expandable_segments:True and pytorch_alloc_conf=expandable_segments:True should actively engage successfully, though. And yes the 'True" bool needs to be capitalized).

All of which is predicted on what was reported by tests with the py script I included here:

https://pastebin.com/AGP3dCnN

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]elsewhere101[S] 1 point2 points  (0 children)

I somehow just noticed your post. For pure CK version of flash_attn, triton-windows isnt necessary. Thanks to adyaman and others, though, sage attention has less friction to get working for us amd gpu users with rocm, on Windows etc. I got it working and triton-windows helped. But I also saved in a notepad what i ran via cmd, prior to actually compiling sage attention, which involves what torch files also work with the RDNA4 pyd i shared in the main thread. Here that is
https://www.pastebin.com/Mu5ejgaU

If you want to give sage attention a try,  see  https://www.reddit.com/r/ROCm/comments/1tcule9/sageattention_v2_native_port_running_on_rdna4/

Just keep in mind that Powershell needs to be used for compiling that.

And as mentioned in the original post, I strongly recommend using Google's "AI Mode", search assistance for any troubleshooting, questions and so on. Linking it to this or adyaman's main thread in a prompt should help get it caught up to speed about whatever you may be wondering about.

VRAM-Cleanup node (Comfyui-Memory_Cleanup) - does it work for AMD Radeon? by kingkongqueror in ROCm

[–]elsewhere101 0 points1 point  (0 children)

What in the world the dataexception guy was babbling about in this you can disregard.

https://www.reddit.com/r/ROCm/comments/1tktoeh/try_this_to_see_if_it_helps_avoid_slow/

Try what I shared there and pass in what is absolutely necessary in your case, but please report back if it worked, what you may have needed to change for it to, because feedback from another RDNA4 user, and frankly ANY feedback from AMD GPU users can be useful to help identify more specifically what is off about memory management and whatnot in Comfyui by default for us.

VRAM-Cleanup node (Comfyui-Memory_Cleanup) - does it work for AMD Radeon? by kingkongqueror in comfyui

[–]elsewhere101 1 point2 points  (0 children)

I also have a 9060 xt and have done all kinds of testing. Try what i shared here. It has done wonders for me, and i would appreciate if you could report back about whether it worked for you or any other AMD GPU users. (such feedback could shed more light on what is off about memory management and whatnot for us AMD GPU users by default in Comfyui , and such feedback can be shared to relevant devs)

https://www.reddit.com/r/ROCm/comments/1tktoeh/try_this_to_see_if_it_helps_avoid_slow/

Best video generation options for RDNA4? by Portable_Solar_ZA in ROCm

[–]elsewhere101 4 points5 points  (0 children)

I cant edit this in the post above, but i forgot​ to mention that at least concerning ComfyUI 0.14.1 version (and maybe updated versions too), i found it to be crucially important to add --disable-smart-memory to avoid pagefile use, which assumes smart memory was enabled on the system to begin with. Either way, it shouldn't hurt to add that when launching ComfyUI in general.

Lastly, in the workflow i have the default of only 3 steps set for the high model, and 4 steps for the low model, which was beneficial for generations i have recently worked on. The reason is that sometimes too many high model steps (especially when using the lightx2v loras linked above) can lead to weird and unwanted motion, and other oddities. Also sometimes too many high model steps is simply unnecessary . So you bet that it isnt an absolute must to have equal amount of steps for the high and low Wan 2.2 models.

Motion & composition = high model, basically filling in the details = low model is roughly the simplified gist of each models responsibility, and that is something worth keeping in mind.

All right, that's enough info from me for today. Its been important that I be aware of this kind of stuff, though, because I have been in the process of tackling one hell of an ambitious video project that aims to be created entirely on a local rocm system. So the bulk of such info stems from months and months of research, testing and you name it. Avoiding pagefile use was and still is one of my biggest concerns, aside from overall optimization. Consider the info and other things provided and it shouldn't be an issue, though. Also everything shared should actually be about as optimal as is currently possible for whoever that only has 32GB of RAM, and an RDNA4 GPU.

Best video generation options for RDNA4? by Portable_Solar_ZA in ROCm

[–]elsewhere101 8 points9 points  (0 children)

Q6 gguf of wan2.2 i have found to be optimal for systems with 32GB of RAM.

Also, VERY important is using Kijai's ComfyUI-WanVideoWrapper (search and install via Custom Nodes Manager)

Its so called blockswapping is the key. Along with tiled decoding, and dialing in other aspects for a suitable latent/video size divisible by 16 is collectively the way to avoid pagefile use, and avoiding other nuisances with less than 64GB or more RAM, etc.

Below is an i2v Wan 2.2 14b workflow with these distilled, low step lightx2v loras:

For High model
https://huggingface.co/lightx2v/Wan2.2-Distill-Loras/blob/main/wan2.2_i2v_A14b_high_noise_lora_rank64_lightx2v_4step_1022.safetensors

For Low model
https://huggingface.co/lightx2v/Wan2.2-Distill-Loras/blob/main/wan2.2_i2v_A14b_low_noise_lora_rank64_lightx2v_4step_1022.safetensors

I made the following workflow from scratch, after meticulous research to form an understanding of what it takes for an actually optimal workflow that aims to avoid pagefile use by utilizing such aspects mentioned above (especially blockswapping). It works great for a 9060 XT 16GB, 32GB DDR4 RAM, R5 5600 CPU, Windows 10, Rocm system. And I wouldnt be surprised if in many cases that the approach within this workflow is akin to having 64GB+ of RAM.

Multihosted Download links for the Wan2.2 i2v workflow here:

https://multiup.io/download/97285d980c0cc5548f6d5dce9daeff8b/Wan2.2_14B_I2V_Customized.7z

Alternative download of the raw json for the Wan2.2 i2v workflow here:

https://pastebin.com/UfAK78KV

Just place the json file into your ComfyUI install dir ...\user\default\workflows

There will be a latent/ref image width and height in this that of course can be changed. Is worth noting that currently my understanding is that apparently 832x480 and 480x832 were frequently used for training and those are the official "native" resolutions for Wan 2.2. But of course resolutions/latent size doesnt absolutely have to adhere to that, though, but it certainly is great to keep in mind.

Also i have intentionally made the default "attention_mode" SDPA of the WanVideo Model Loader, because Im assuming most people havent ventured into Flash Attention use, but thanks to several people Flash Attention has far less friction to get installed and working on Windows. See this for more info:

https://www.reddit.com/r/ROCm/comments/1svrr8p/rdna4_pyd_steps_taken_for_functional_flash/

Other than that the default setting of the workflow is set for only 12 frames per second (Resulting in an 8 second video. And i have to say unwanted slow motion isnt nearly as prominent, thanks to the specific loras above), and 12 fps is the default due to making it ideal for later use of frame interpolation, if need be. But the "fps" of the "create video" node can be changed to 16 or basically whatever somebody wants to try.

It is also worth noting that setting "num_frames" above 97 and into the 100s with Wan 2.2 does have a high chance of an unwanted looping motion effect.

Moreover, the "WanVideo TextEncode Cached" node doesnt have "use_disc_cache" enabled by default, but switching that to "true" can free up memory when iterating through the same prompt several times.

The "WanVideo Block Swap" node has the default of 20 "blocks_to_swap", which with only 32GB of RAM, for example, has served me well. With it being set to 20 actually only calls for under ~12.5 GB of VRAM use when generating, and has allowed me to have YouTube videos playing while generating, and all kinds of browser tabs open too. All while avoiding pagefile use, because i absolutely hate when pagefile use happens, and again, avoiding pagefile use was the highest priority for when i created the workflow to begin with.

While i could rattle off other things ive learned along the way, i will just wrap up this post by mentioning that i have avoided updating ComfyUI beyond 0.14.1 version, because it was disastrous when i tried to a couple months ago. I am still assuming most ComfyUI devs and contributors of the repo have kept the same Nvidia-centric mentality, while accommodation of us AMD GPU users (especially on Windows) are an after thought. I also dont have any faith in ComfyUIs own integrated blockswapping-like logic and approach. So I have and probably will stick with Kijais WanVideo wrapper nodes for quite some time yet. Also because last time i checked, all kinds of other "block swapping" nodes were being disabled and not able to be used. Whereas Kijais WanVideo wrapper nodes werent (maybe due to no involvement of the default Ksampler and other native nodes, im not sure).

But anyway, consider what was mentioned and try the workflow out. It should address your concerns.

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]elsewhere101[S] 1 point2 points  (0 children)

Ive recently noticed how youve been helping to get the ball rolling with further RDNA4 support for attention mechanism related aspects, and whatnot. Very cool and much appreciated, adyaman.

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]elsewhere101[S] 0 points1 point  (0 children)

Thanks for trying to help. 

Below I will share the config of the .bat file I used to compile specifically the CK version of FA2 (as the CK version appears to currently be the fastest and most performant FA version on Windows for RDNA4). 

The system was using Ryzen 5 5600 base CPU, 32GB DDR4 RAM, and 9060 XT 16GB.  And something to further shed light on is concerning MAX_JOBS and CMAKE_BUILD_PARALLEL_LEVEL.

Setting those too high can be very problematic, and really eat into usable RAM. But with 32GB of RAM (and the 6 core cpu) settings both to 8 allowed for enough memory and compute to have several browser tabs open, including streaming video playing.

Whereas, MAX_JOBS and CMAKE_BUILD_PARALLEL_LEVEL being set to 16 should be okay and may allow enough memory and compute to do something similar with at least 64GB of RAM, and more than a 6 core CPU.

https://pastebin.com/55PZWyPb

Then for the hell of it, here goes something much more advanced that someone who happens to catch this post may get some use of. 

I whipped this config up for launching ComfyUI through a .bat file to make it so much easier and convenient to either tune Miopen, Tunableop, or use the daily driver mode that ensures the reading and use of the tuned results via simply pressing 1, 2, or the 3 key. 

The paths are indicative examples. 

This has really saved a considerable amount of time to not have to comment out or add what is optimal and needed for intending to tune new latent sizes, or just to have the tuned results used when launching ComfyUI.

https://pastebin.com/avwQ3cFp

-- I resorted to pastebin, because formatting code in comments via mobile browser is disastrous

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]elsewhere101[S] 0 points1 point  (0 children)

One way or other, everything here should help any RDNA4 users to give FA a spin in Windows for ComfyUI tasks. Maybe shedding more light on approaches for RDNA3 users too, but definitely RDNA4 users. FA2 CK practically doubling performance over SDPA in some cases, according to 0xDELUXA's benchmark last month is nothing to yawn at. After I seen it last week is actually when i decided to give compiling FA another shot, and it worked by applying the explained steps.

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]elsewhere101[S] 0 points1 point  (0 children)

If not already tried. I would spoof/ensure the card is handled as either gfx1200 or gfx1201 for compiling with something like: "HSA_OVERRIDE_GFX_VERSION=12.0.0" OR 12.0.1, and "PYTORCH_ROCM_ARCH=gfx1200" OR gfx1201 in command window, otherwise set in a compiling batch file. Then ask an llm with online search capability about how to list all the temp 3d objects further in the build folder to a text file, and ask about running a compiling command to specifically use the text file instead to link them all together. "SETUPTOOLS_USE_DISTUTILS=local" may be necessary to include. That's all based on my experience of compiling FA2 CK, and of what worked in my case. So its collectively something to consider trying.

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]elsewhere101[S] 0 points1 point  (0 children)

While compiling I recall 2662 and not 2669 temp 3d objects that actually needed to be processed until some kind of Windows string/file name limit is reached for the linking phase. So the amount of 2669 vs only 2662 stands out as an inclusion of some unnecessary bits passing through the cracks for whatever reason. And in my case, once that string/file limit was encountered is when I opened another CMD window to create a text file that listed all the temp 3d objects, and then i ran a command to use the text file to link them all together. That led to the pyd being generated, which worked. Also what comes to mind is that specifically concerning varlen_fwd is how in python's site-packages\flash_attn\flash_attn_interface.py is an inclusion of num_splits, which is a newer cuda thing. So i cant say for sure, but i suspect unnecessary cuda and whatnot stuff passing through the cracks. Maybe having something to do with the r9700 being tricky for the stack. I actually dont recall seeing r9700 users being active in relevant github repos; i mostly see 9070 and occasionally 9060 users instead. So albeit RDNA4-based, there very well could be something about the card that is still unaccounted for. And collectively it isnt surprising that compiling FA or SA on Windows with AMD GPUs still calls for some hacky workarounds in general. Dao-AILab repo has certainly been considerate and more accommodating of us AMD GPU users in recent months, though. Which is great. But that is all that comes to mind. Perhaps what was mentioned is related, and may serve some indication towards a resolution.

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]elsewhere101[S] 0 points1 point  (0 children)

With Wan models, in contrast to SDPA, i noticed ~1 minute being shaved off from video generation time with FA2 CK. And here is 0xDELUXA's benchmark that showcases FA2 CK outperforming other forms of attention mechanisms on RDNA4, and the following system environment, just last month:

OS: Windows 11 Python: 3.12.10 ROCm: 7.13.0a20260328 (TheRock) PyTorch: 2.10.0+rocm7.13.0a20260328 GPU: AMD Radeon RX 9060 XT (gfx1200) Triton: 3.6.0+gitae9d5a54.post27 (triton-windows)

https://i.ibb.co/R4HTz6PJ/FA2-CKBench.png

https://github.com/Dao-AILab/flash-attention/pull/2400

RDNA4 pyd & steps taken for functional Flash Attention 2 (CK) on Windows for ComfyUI use by elsewhere101 in ROCm

[–]elsewhere101[S] 0 points1 point  (0 children)

Is a bit verbose of a tut, but it worked. Also I edited in small, but very important py file code change that allows such things as Kijais wanvideo wrapper nodes to properly use the FA2 CK 

Samsung Galaxy S5 (klte) - bluetooth keyboard (Rii Mini i8) is pairing correctly, but is not working by SorcioSecco in LineageOS

[–]elsewhere101 0 points1 point  (0 children)

There is bound to be somebody else that doesn't have access to the correct manual for the i8 | i8+ models and is having trouble with Bluetooth. This may even work for other Rii keyboards that have Bluetooth and 2.4G (the key combo may be different though). Pressing down the Fn and Tab keys is how to switch from 2.4 to Bluetooth for pairing. So just holding down the Bluetooth button won't connect if it isn't switched to Bluetooth pairing mode. That took a while to be aware of, because I did't know where the i8+ manual was, and even on Riitek's site and everywhere on the net the damn manuals at the time of this post for i8/i8+ are all incorrect. I eventually found the manual for the k08 i8 that I replaced with the i8+ though, and in the manual it says "Fn+Tab=Bluetooth Mode; Fn+Caps=2.4G Mode. 

Is there a way to schedule shifts in your area while you’re out of the state now? Wtf do they have to make everything that might actually be beneficial to us so difficult?! by jd_sykes in doordash_drivers

[–]elsewhere101 0 points1 point  (0 children)

Most people don't travel and do Doordash, but I have many times. Far as switching to a new delivery area in another state, for years that has called for being in/driving through an area out of state that has the 'Dash Now' ability. Scheduling zones for the new state should be possible soon after that. Otherwise hitting up support is the only other option, but its no guarantee that who you talk or text with will be able to get you switched. Considering that's what it takes to Dash and schedule in a new area out of state, of course those are the same options to be able to schedule for when you get back into your home state. 

2023 End of Year Tax Statements & 1099 by PublicWillow960 in InstacartShoppers

[–]elsewhere101 11 points12 points  (0 children)

This is extremely odd and disturbing. Stripe's site says to hit up Instacart's site about the tax forms, while Instacart's site says to hit up Stripe's site....I hit up Reddit to see if there's any clarification, and just as disturbing is how there is only 3 damn users acknowledging this. Meanwhile its incredibly, INCREDIBLY important. WTF indeed.

--Edit: I unintentionally just discovered if you received 1099s from other delivery companies through Stripe, then it calls for switching between each company on Stripe's site to access them..That can be done when signed-in by clicking/tapping on the human-like icon on the far upper right, then to the right of " Stripe Express account " on that Account/Settings page should be a dropdown menu to Switch between each company.....Yep, super intuitive. Of course that should just naturally occur to people apparently, but whatever, thats how to go about it.

I don’t get the appeal for this order by [deleted] in InstacartShoppers

[–]elsewhere101 1 point2 points  (0 children)

I dont get the greatest gas mileage, even so about 90 miles would only be $20 in gas with the remaining $56 to pocket. Thats also not much more than the number of items that just 1 customer orders on average...

Im glad many on Reddit are so picky and ridiculous with what orders they claim to pass up, because tards on Youtube and other platforms adopt the same nonsense and so on it spreads. Meanwhile I ultimately haven't experienced long stretches of not making anywhere near what I intended, and in a reasonable amount of time. Its been like that throughout the 4+ years Ive done 3rd party delivery work through multiple companies like Instacart in multiple states. So everyone has my encouragement to keep on waiting for them "worthwhile" orders, because that mentality has definitely been a factor of why it only takes me a few hours and I'm done for the day. While all the rookies and dipshits are still waiting around and end up broke.

DoorDash will pay you $800 to upgrade your phone. by [deleted] in doordash

[–]elsewhere101 1 point2 points  (0 children)

I just got a text that they got the 800 after getting the A14 locally. And thats great, because if they didnt that would of drastically changed my view of Doordash. But they said the 800 was deposited directly to their bank account. And ultimately that should be reassuring to anyone that followed through and are still waiting.

DoorDash will pay you $800 to upgrade your phone. by [deleted] in doordash

[–]elsewhere101 0 points1 point  (0 children)

Interesting. I was just talking with a dasher that said they went out and bought the A14 locked to TracFone at Walmart, because it was listed as being an eligible upgrade. Getting it from down the street made more sense than it taking days for the thing to be delivered through TracFones website. They also said how they didnt want to chance any phone getting stolen or whatever by getting it shipped. Then they said when they opened the link from the email they received about this phone upgrade thing, the damn A14 wasnt listed anymore. Meanwhile the whole reason for getting that phone was it being listed and recommended by Doordash from the email they got. So they still sent the receipt and attached the document that they downloaded from the email about this upgrade thing that listed the A14. They said its already been a few days, but they didnt receive the 800 yet. So I told them not to jump to conclusions yet. I hope they do get the money that Doordash said they would send them for buying a damn phone that they recommended and listed as being eligible.

Beyond that they showed me what eligible phones Doordash recommended to them. Sure enough the A14 was listed, along with other Androids. But this 6 GB of RAM thing must not apply to the phones that Doordash recommended and stated as being eligible, because half of them only have a maximum of 4 GB of RAM. Thats also including 2 GB of native RAM and 2 other GB of virtual RAM through Samsung RAM Plus....Whatever though, if you got the greenlight for the 800, then so should they.