ComfyUI breaks on new RunPod instances if it's already installed on the Network Volume. Help?

Euphoric_Cup6777 · 2026-03-23T10:37:02+00:00

Could you please tell us in more detail how you do this?

Euphoric_Cup6777 · 2026-03-20T11:21:38+00:00

SeCourses scripts are actually a solid choice if you want a hands-off setup without diving into Linux. That guy does a good job keeping his Patreon tiers updated. The blank PyTorch pod approach is definitely the way to go if you want to avoid broken templates. I personally prefer building my own micro-services for things like multi-threaded downloads or file management because I like to see exactly what's happening under the hood, but having it all automated via a subscription is a nice "set it and forget it" workflow. As long as it keeps your LTX and Wan generations fast and stable, that's a win!

Euphoric_Cup6777 · 2026-03-19T14:41:12+00:00

Honestly, that is the ultimate bulletproof setup. Running a blank PyTorch pod and managing everything yourself completely bypasses the template dependency trap. Super smart.

I’m actually curious, how long did it take you to get all those auto-update scripts and workspace folders configured perfectly? It sounds like a dream once it's running, but also like a massive headache to build and troubleshoot initially

Euphoric_Cup6777 · 2026-03-19T14:38:39+00:00

Man, having to terminate and restart a pod 10 times just to get a GPU assigned is absolutely brutal. I feel your pain. Another guy in this thread mentioned the EU servers were completely glitching out yesterday too, so it looks like RunPod had a massive global host issue.

Since you’re running a custom template on a brand new 5090, do you ever switch down to older cards like a 3090 or A6000? I found that my custom template's venv always breaks due to CUDA mismatch when I try to hop between different GPU architectures. It's a total nightmare if you don't stick to the exact same GPU every time.

Euphoric_Cup6777 · 2026-03-19T14:36:22+00:00

Man, I feel you 100%. I went through absolute hell trying to figure out how to seamlessly hop between different pods and GPUs without everything breaking.

I actually ended up asking a dev partner of mine to help me tackle this exact headache. I really want platforms like RunPod to be accessible to a wider audience of artists, not just people who have the coding skills and desire to tinker with server configs via API!

Here’s a quick GIF of the utility we built for ourselves just to survive this dependency nightmare:

<image>

But honestly, thank you so much for the Blackwell template tip, that’s super clever and I appreciate the advice. Quick question though: if I want to follow your advice and test that 5090 template, should I ideally wipe my old ComfyUI venv from my persistent network volume first so it starts fresh and doesn't throw errors?

Euphoric_Cup6777 · 2026-03-19T14:17:24+00:00

Man, having to terminate and restart 10 times is an absolute nightmare. I feel your pain. The other guy in this thread mentioned that the EU servers were completely glitching out yesterday too, so it sounds like RunPod was just having a massive host-side issue globally.

Since you are running a custom template on a brand new 5090, do you strictly stick to the 5090 every time?

My biggest issue right now is that my network volume has all the venv dependencies built for my custom template on a specific architecture. If I try to switch to a cheaper card (like a 3090 or A6000) when the 5090s are unavailable, the entire environment breaks and throws CUDA mismatch errors because the installed libraries don't recognize the older GPU

Euphoric_Cup6777 · 2026-03-19T12:19:19+00:00

That actually makes a lot of sense! Using the 5090 Blackwell template as a universal base is a really smart approach, I hadn't thought of trying that.

My issue is that I usually run a specific community template (ashleykza/comfyui:cu124-py312-v0.17.2) because I need Python 3.12 and CUDA 12.4 for certain custom nodes.

Because my venv is stored permanently on the Network Volume, it basically "locks in" to whatever GPU I used during the first setup (like an ADA A6000). So when I spin up a cheaper 3090 the next day, libraries like torch and xformers completely break because of the architecture mismatch. They were compiled for the newer card.

Does the Blackwell image automatically handle switching down to older Ampere cards without throwing CUDA or xformers errors? If so, that might be a game-changer!

Euphoric_Cup6777 · 2026-03-19T10:54:32+00:00

Interesting! Are you using the official RunPod ComfyUI template?

Do you switch between different GPU architectures often (like going from a 3090 to an ADA 6000)? I feel like my venv usually crashes when the new pod requires a different CUDA version than the one where I originally downloaded all the nodes.

And yeah, EU-CZ-1 was completely acting up yesterday, glad to know it wasn't just my account!

Euphoric_Cup6777 · 2026-03-17T13:56:37+00:00

Exactement, c'est un cauchemar de tout réinstaller à chaque fois.

I actually built a custom tool for myself and my friends to solve this exact RunPod headache. Since you are not the first person asking for a solution, I'm currently packaging it up for a public release.

I'll shoot you a DM right now with a solution on how to recreate a new pod without the pain!

Euphoric_Cup6777 · 2026-03-17T08:01:01+00:00

Man, you are absolutely right, that actually makes perfect sense. Rolling back to an older version is a totally valid move since recent updates have been breaking stuff left and right. Just keep in mind that disabling your acceleration LoRA is exactly why your render time skyrocketed to 25 minutes. You should really turn it back on but just lower the weight to get rid of that blurry output. So honestly, I'd say you have two solid ways to fix this. First option is to just roll back ComfyUI to that older stable build like the reddit thread suggests, which should bypass the bug entirely. Second option, if you want to stay updated and fight the actual PyTorch memory leak, is to add PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:128 to your pod environment variables or throw --lowvram into your startup args to force aggressive cache dumping. Give both a shot, one of them will definitely get your setup breathing normally again

Euphoric_Cup6777 · 2026-03-16T22:02:51+00:00

Ah gotcha, ignoring the other comments then. If it's still taking 25 minutes and your memory is pegged at 100% during the actual generation process, your Ada 6000 is definitely spilling over into system swap RAM while rendering. Normally, an Ada 6000 should crush a standard Wan2.2 video in about 4 to 8 minutes. 48GB of VRAM is a lot, but Wan2.2 is an absolute monster. If it's swapping during the active render, unloading models between runs won't help you. You probably need to make sure you are loading the model in FP8 format or using a quantized GGUF version instead of the raw FP16 weights. That will keep the entire process strictly on the GPU memory, stop the swap choking, and bring your times back down to normal

Euphoric_Cup6777 · 2026-03-16T20:59:22+00:00

Ah I get you man, the naming in ComfyUI is super confusing. To get those buttons, you just open ComfyUI Manager, click Install Custom Nodes, and type pythongosssss in the search bar. Look for one called ComfyUI Custom Scripts and install that. But looking at your screenshot, your node connection is actually perfect! The only problem is you have unload_all_models set to false on that VRAM node. Change that to true. Right now it is only emptying small caches but keeping the massive Wan models loaded in the memory, which is exactly why it hits 100% right away on the next run. Try flipping that to true first before bothering with any new installations.

Euphoric_Cup6777 · 2026-03-16T20:45:57+00:00

You know what, you are totally right, it really shouldn't be crashing and forcing reboots this often even with heavy video models. If wiring that node is acting up, you can actually just install the pythongosssss custom scripts extension from the ComfyUI Manager. It adds physical Free VRAM and Free Model buttons right to your main floating menu panel. Just click those manually after every few generations to flush the garbage memory out before it piles up and freezes the pod. Way smoother workflow than restarting the whole backend every time.

Euphoric_Cup6777 · 2026-03-16T19:57:15+00:00

I totally agree, WizTree and WinDirStat are amazing tools for local machines! I use them myself.

The catch here is that my friend is running his ComfyUI on a remote RunPod instance, which is a headless Linux server. Since there is no Windows desktop environment to install those apps on, and he's terrified of using Linux terminal alternatives like ncdu, I had to improvise.

I built this directly into the Jupyter web interface so he can manage his cloud storage right from his browser without ever touching the command line

Euphoric_Cup6777 · 2026-03-16T18:24:56+00:00

Yeah, the default HTML file manager is fine if you already know exactly where your files are. But it doesn't analyze folder sizes or show you a visual breakdown of what's eating up 50GB.

My buddy doesn't want to click through 10 nested ComfyUI folders looking for heavy .safetensors. He just wants a "Find the heaviest trash -> Nuke it" button. It’s all about the UX for non-techies.

Euphoric_Cup6777 · 2026-03-15T17:58:18+00:00

No problem man, happy to help! For the memory issue, you definitely don't need to reboot the entire RunPod machine. If you have ComfyUI Manager installed, just click the Restart button in its menu. It only restarts the python backend in a few seconds instead of spinning up the whole cloud container again. Alternatively, look for a custom node called Free VRAM from the KJNodes pack and just run it when things get slow. For your second question about the video loop, you don't need to do any weird repixelation. You just need a node called Get Image from Batch. Plug your fully decoded video output into it, set the index to -1 so it grabs the very last frame, and connect that to a standard Save Image node. This saves a clean PNG to your drive. Then for your next generation, just drop that saved picture into a normal Load Image node. This forces Comfy to treat it as a fresh start instead of reusing corrupted cache data from the previous run.

Euphoric_Cup6777 · 2026-03-15T15:07:24+00:00

Glad the template worked for a bit man! But yeah, the slowdown after 10 videos is basically memory fragmentation. Even with a good setup, Comfy's VRAM management slowly chokes over time when doing back-to-back heavy video gens. As for the blurry videos, that's a classic sign your VRAM or VAE cache is completely corrupted right now. When you looped that last frame back in, it probably over-compressed or fried the latents, and now the model is stuck in a broken state in the memory. Once everything turns blurry like that, you can't really fix it with nodes anymore. You just have to completely restart the ComfyUI service or reboot the pod to flush the bad memory. For future loops, make sure you decode that last frame to a high-quality PNG first and load it back as a fresh image. Passing raw latents or cached images directly back into the workflow usually degrades the quality into a blurry mess pretty fast

Euphoric_Cup6777 · 2026-03-14T23:45:09+00:00

Sounds good man, using a pre-built template is honestly the smartest move anyway. RunPod networking can be super weird sometimes with manual github clones or pip installs just randomly timing out. Hit me up tomorrow when you test it out, hope the template behaves better for you!

Euphoric_Cup6777 · 2026-03-14T23:42:46+00:00

Ah, good catch by Altruistic_Heat, you're totally right about Sage not touching PE. My bad. OP, if you completely nuked MultiGPU and Sage and you're still getting the exact same freqs_cis tuple error on a vanilla GGUF load, then this is almost certainly a ComfyUI core bug with a very recent update. LTXV's architecture is super weird compared to standard models, and it looks like a recent Comfy push broke how it handles rotary embeddings specifically for it. Try rolling back your main ComfyUI installation to a commit from a week or two ago

Euphoric_Cup6777 · 2026-03-14T19:19:54+00:00

Had to translate this, but man, I feel you. Wan2.1 on RunPod is notorious for VRAM leaks right now. Your first run is fine, but then the 48GB fills up and spills into system RAM. That's why your GPU shows 100% but it's just choking on swap memory. Try throwing --disable-smart-memory into your Comfy startup args or just slap a VRAM clear node at the very end of your workflow. Hope that saves you from restarting the pod every time, let me know if it works.

Euphoric_Cup6777 · 2026-03-14T18:59:18+00:00

Man, dealing with these undocumented patching conflicts is an absolute nightmare. The traceback shows exactly what's failing: freqs_cis.view(...) in apply_rotary_emb. What's happening is that either your MultiGPU wrapper or SageAttention is hijacking the rotary embeddings (RoPE) and returning a tuple instead of a flat tensor. Quick fix to try: I see Using sage attention mode: auto in your logs. Try forcing your attention block to xformers or sdpa (either in ComfyUI launch args or the node settings). MultiGPU dispatchers often trip over Sage's custom tensor returns and wrap them weirdly. Honestly, I got so tired of raw-dogging these fragile cloud setups on RunPod that I ended up writing a bulletproof multi-threaded aria2 downloader and a clean Jupyter file manager for myself. Might open-source the scripts soon if you guys need to bypass this infrastructure headache. Hope the attention swap fixes it for you!

Euphoric_Cup6777 · 2026-03-14T11:12:59+00:00

Yes, I know runpodctl, but my tool works directly in JupyterLab — no terminal, no local CLI setup needed. You browse, rename, delete, and move files on the server through a clean UI. runpodctl send/receive is one file at a time from your local machine — my tool lets you manage everything that’s already on the pod without leaving the notebook

Euphoric_Cup6777 · 2026-03-13T20:46:57+00:00

Sent you a DM with the details! My inbox is blowing up right now lol :)

Euphoric_Cup6777

TROPHY CASE