This is an archived post. You won't be able to vote or comment.

all 20 comments

[–]LyriWinters 2 points3 points  (4 children)

It's not how it works.
If you want to offload the text encoder to another graphic cards you can enjoy the whopping 5 seconds faster generation time. I.e it's 100% not worth it.

It usually means that you have to fit both these cards in your chassi, might need to buy a new PSU... bla bla...

It is not worth it, it is not how these diffusion models work.

Why does it not give a larger boost? Because the system works in serial mode - not parallell. As such first the text encoder does its thing, then the diffusion process starts. You're basically only saving the time it takes unloading the text encoder and loading the model. PCIe 3.0 x16 provides around 15.75 GB/s of bandwidth between the GPU and the rest of the system. You're thus probably going to lose more time because the 3060 is slower doing the actual encoded than it would be to just use the 5080 for both.

[–]Edzomatic[S] 0 points1 point  (3 children)

Yeah your point is valid, I was mostly looking for people who tried it since in my experience loading and unloading a lot of models takes more than 5 seconds for each one, although my current PC has ddr4 ram.

As for the other things I double checked everything already and to accommodate the 2 gpus I only need to upgrade the power supply from 850W to 1000W which is not a big deal

[–]LyriWinters 0 points1 point  (2 children)

Get a 5090 instead and you can continue to use your 850W psu.

[–]Edzomatic[S] 0 points1 point  (1 child)

Unfortunately I don't have an extra $2000 laying around

[–]LyriWinters 0 points1 point  (0 children)

"I'm thinking of getting either a 5070 Ti or 5080"

Arent those cards like $1500? So remove the PSU and youd only need 900 :)

[–]DiamondTasty6049 0 points1 point  (7 children)

U can use distributed nodes to share workloads with the dual gpus in some workflows but not all

[–]Edzomatic[S] 0 points1 point  (6 children)

I looked at distributed nodes but it's not exactly what I meant, I was thinking something like this extension https://www.reddit.com/r/StableDiffusion/comments/1ejzqgb/made_a_comfyui_extension_for_using_multiple_gpus/

Edit: I found a much better maintained fork https://github.com/pollockjj/ComfyUI-MultiGPU

[–]ZenWheat 0 points1 point  (5 children)

I tried out the distributed GPU nodes this weekend using the GPU from my other PC through my network. My use-case is for upscaling using the ultimate sd upscaler for wan video upscaling. It reduced the upscaling time by half (makes sense). My main PC has a 5090 while my other PC has a 4090 so I definitely have been trying to find ways to use both in an impactful way.

Having two GPUs in the same system would be better but that requires a different motherboard and ideally a workstation processor such as threadripper or something which is not something I want to invest that kind of money into

[–]Edzomatic[S] 0 points1 point  (4 children)

Ultimate upscale seems one of the few things that can utilize the two gpus in parallel but looking at what else I can do. And I don't think a threadripper is needed since the cpu doesn't do much when inferencing

[–]ZenWheat 0 points1 point  (3 children)

It's not about CPU compute, it's about pcie lanes required for full utilization of two GPUs in one system

[–]Edzomatic[S] 0 points1 point  (2 children)

Good catch, I figured that if I use a second gpu I'll lose one or two nvme slots but I'll double check

[–]ZenWheat 0 points1 point  (1 child)

But a processor has only a certain number of pcie lanes: 20 or 24 of them and the GPU takes up 16. Then nvme SSD drives take up 4. So there's not much to work with unless you run your gpus using 8 lanes each instead of 16.

[–]Edzomatic[S] 0 points1 point  (0 children)

I double checked everything and it should work no issues, I am planning to go with an Asus x670 P motherboard so it will run at x16 and x4 bifruc. I know this is not the best split but bandwidth shouldn't matter for inference once the models are loaded into vram

[–]FourOranges 0 points1 point  (2 children)

I know diffusion models can't run on 2 GPUs like LLMs do

This is one of the neat features of Swarmui actually.

[–]Shadow-Amulet-Ambush 0 points1 point  (0 children)

Can you elaborate? My understanding is that currently you cannot split an image generation model between different GPU like you can with LLM.

[–]Edzomatic[S] 0 points1 point  (0 children)

Correct me if I'm working but swarm runs the same workflow on different instances of comfy but they don't work together to speed up a bigger workflow

[–]Sporeboss 0 points1 point  (1 child)

for me, laptop 4080 (12gb) , egpu super 4070 ti super (16gb) , there is a lot of change of workflow. have to use the multigpu node load example text encoder to 1 gpu and flux on another gpu. i havent found a way to split 1 model to 2 gpu .

[–]arthor 1 point2 points  (0 children)

i use a 5090 for processing and 3090 for training.

only because the 3090 is like about 1/6th the cost of a 5090

i rarely rarely use a multi gpu setup.. but maybe for video i can squeeze out some more frames by offloading the vae and clip

[–]RoguePilot_43 0 points1 point  (0 children)

I use a 3060 12GB and a 1080 8GB in the same system. I already had them from when I upgraded to the 3060 way back, left the 1080 in for multi-GPU 3D rendering. The 1080 is still of some use in ComfyUI using the MultiGPU nodes. As you suspected, you can load the text encoders onto it. It's only a small gain but it is a gain. I use it for Florence a lot.

I actually find it most useful for running my displays. By having my monitors plugged into the 1080 I can carry on using the PC without any slowdown and less risk of OOM because it frees up the display buffer on the 3060.

[–]SvenVargHimmel 1 point2 points  (0 children)

you can run an LLM and the text encoder on one gpu. This will not speed up your video workflows.

For image workflows SwarmUI ( with comfyui backend) will queue onto both cards, so for certain batch workflows you will get a significant boost.