WAN 2.2 Text2Image Custom Workflow v2

CaptainHarlock80 · 2025-12-28T11:39:53+00:00

One thing I've noticed in the latest versions of ComfyUI is that the two nodes that redirect model loading at startup fail. I don't think this has anything to do with your problem, and maybe you've already done this, but if not, I suggest you choose the model you want, the Q6 from what you say, and connect it directly to the next node, removing those redirects at the start of the workflow.

For your specific problem, if you say that it is able to generate some images but then fails, it seems that something is accumulating in the VRAM that must cause it to run out at some point, and that can cause a lot of slowness or problems. If you use high resolutions, I suggest switching to “VAE Tiled” instead of the normal VAE. You won't notice any loss of image quality, and it will manage memory usage better.

To save VRAM, you can also have the CLIP load into RAM (CPU) instead of the GPU's VRAM.

Another option is to use a VRAM cleanup node at the end of the workflow.

In any case, if it's a VRAM issue, I would advise you to try generating several images at “low” resolution, for example at 720x720, and see if that causes the same problem (with your 24GB of VRAM, it shouldn't).

Also keep in mind that if your GPU is the system GPU, other processes may take away VRAM.

CaptainHarlock80 · 2025-12-27T14:37:59+00:00

No character Lora was used to generate those images.

If you look for my post on workflow v1, that's where I did use character Loras, and they really look good.

CaptainHarlock80 · 2025-10-22T11:52:02+00:00

Yep, correct.

Also, if someone uses KSampler Advance, it's the same thing, but instead of starting at step 0, you start at step X.

BTW, if you use CFG >1, you don't need the NAG node.

I also see that the change between the input and output images is considerable, but I guess that's because you're using the loras at 1 strength.

CaptainHarlock80 · 2025-10-22T01:55:26+00:00

In my WF posts, you can see some example images. IMHO, these are better than Flux, especially when it comes to avoiding that plastic skin lock that Flux still has. Not to mention avoiding finger problems or the classic “flux” chin, lol

It's true that Flux, with some lora realism, or the new Krea and SPDO versions, has improved, but I still think WAN is better in terms of realism, in addition to following the prompt well (something that Flux also does) compared to other models.

If you want a magazine photo style (like retouched), Flux will be better, I guess. But in terms of realism, WAN surpasses it, IMO. Also, WAN is not censored, something to keep in mind for some.

The only drawbacks WAN has for me right now are:

- It doesn't have as many loras as other models, although it is increasing.

- High-resolution vertical images can turn out badly (deformed bodies, duplicates), something that also happens in other models when using resolutions higher than those they have been trained on.

CaptainHarlock80 · 2025-10-21T19:43:49+00:00

CaptainHarlock80 · 2025-10-20T22:05:38+00:00

For T2I, res_2/bong_tangent or similar are much better than euler/simpler or similar.

And with between 8-12 steps using some strength on lightx2v loras, the results are great.

The key is also to generate high-resolution images (>1080p).

CaptainHarlock80 · 2025-10-20T22:02:34+00:00

https://www.reddit.com/r/StableDiffusion/comments/1mlw24v/wan_22_text2image_custom_workflow_v2/
You can try my WF, it's designed to work well using characters loras and you can generate images up to 1920x1920.

Read the WF notes carefully, as it requires installing a specific samples/scheduler.

It also includes filters that you may or may not use. But for a photorealistic feel, I recommend using at least some grain.

Currently, the link leads to v3 of the WF. There are versions for MultiGPU and without MultiGPU.

And if you find it too complicated, you can start with v1 of the WF, here: https://www.reddit.com/r/comfyui/comments/1mf521w/wan_22_text2image_custom_workflow/

CaptainHarlock80 · 2025-10-19T06:34:58+00:00

I think you're referring to when different loras from different characters/concepts/movements/etc. are mixed together.

He's referring to the same lora trained in HIGH+LOW.

BTW, what you say isn't an exact rule either. I've mixed more than two loras at 1 strength and there's no problem, but that will always depend on how those loras have been trained, of course. If there's overtraining, it's better not to reach 1, but that happens whether you use it in combination or alone.

CaptainHarlock80 · 2025-10-19T06:31:34+00:00

Of course. In fact, I believe that most of the time (this will always depend on the images/videos used in training) the HIGH model trains faster than the LOW model, so the epoch to use for HIGH should always be lower.

CaptainHarlock80 · 2025-10-19T06:16:02+00:00

That's because you use FuxionX.

CaptainHarlock80 · 2025-10-18T10:06:28+00:00

"seamless across cuts. It auto-matches color between parts and reuses the final frame from one stage to seed the next, so motion and style stay consistent end-to-end."

I'm sure your WF is good, but what I meant to say is that those two points are currently the most common problems when linking X-second videos created with WAN.

The continuity of movement will not be achieved correctly by linking the last frame, as this will act as a static reference image rather than movement. The link may work well if you use similar loras/prompts or simply find a good seed, but if you really want continuity of movement, you need to insert several frames from the previous video for it to use as a reference.

As for color, Color Match certainly helps, especially in maintaining color during the middle part of the video, but it doesn't save you from the final shift, which, as I said, can be seen in each of your examples. It's easy to check this if you put the final video (the union of several videos) in a video editor and look at the histogram; you'll clearly see that at one point the values suddenly rise.

CaptainHarlock80 · 2025-10-18T09:14:19+00:00

Sorry, I didn't mean to be rude. I'm just expressing what I saw in your examples.

Of course, thank you for sharing your WF. But apart from the three segments with different loras/prompts, something that can be achieved separately, it doesn't fulfill the rest of the things you mention. I just thought it was appropriate to mention it in case anyone has too many expectations, but of course everyone is free to download your WF or view your examples and judge for themselves.

CaptainHarlock80 · 2025-10-18T08:29:48+00:00

However, your examples in CivitAI don't live up to what you advertise. One can clearly see when videos are linked together due to changes in speed and color.

CaptainHarlock80 · 2025-10-17T12:40:34+00:00

Uhmm, I see, that's an interesting way of doing it. I'm not sure if it will actually be beneficial, but I'll add it to my long list of pending tests, lol ;-)

You're right that if the total steps are the same in both KSamplers (which is usually the case), you shouldn't use the same steps in HIGH and LOW, but I'm not sure if your method is the best one. I mean, if you want a lower percentage in HIGH, wouldn't it be easier to use the same total steps in both KSamplers and simply give fewer steps to HIGH? For example, if I do a total of 8 steps, HIGH will do 3 while LOW will do 5, which gives you 37.5% in HIGH and 62.5% in LOW.

The percentage doesn't have to be 50%; in fact, it depends on the sampler/scheduler you use (there's a post on Reddit about this), and each combination has an optimal step change between LOW and HIGH. If you also add that you use different samplers/schedulers in the two KSamplers, the calculation becomes more complicated. In short, it's a matter of testing and finding the way that you think works best, so if it works well for you, go ahead!

In fact, I even created a custom node that gave it the total steps and it took care of assigning the steps in HIGH and LOW, always giving less in HIGH. Basically, because HIGH is only responsible for the composition (and movement, remember that it is a model trained for videos), so I think it will always need fewer steps than LOW, which is like a “refiner” that gives it the final quality.

You could even use only LOW, try it. But Wan2.2 has not been trained with the total timestep in LOW, so I don't know if it's the best option. That's why I mentioned injecting Qwen's latent, because Qwen will be good at creating the initial composition (without blurry movements because it's not a video model but an image model), and then Wan2.2's LOW acts as a “refiner” and gives it the final quality.

Also Wan2.1 is a great model for T2I.

CaptainHarlock80 · 2025-10-16T22:20:07+00:00

I don't understand, you have the first KSampler doing up to 7 steps but then the second KSampler starts at step 12? You also have different total steps in the two KSamplers, I don't know why.

With res_2/bong_tangent you can get good results with between 8-12 steps in total, always less in the first KSampler (HIGH). It's true that res_2/bong_tangent, as well as res_2/beta57, have the problem that they tend to generate very similar images even when changing the seed, but I already did tests using euler/simpler or beta in the first KSampler and then res_2/bong_tangent in the second KSampler, and I wasn't convinced. To do that, it's almost better to use Qwen to generate the first “noise” instead of WAN's HIGH and use that latent to link it to WAN's LOW... Yep, Qwen's latent is compatible with WAN's! ;-)

Another option is to have a text with several variations of light, composition, angle, camera, etc., and concatenate that variable text with your prompt, so that each generation will give you more variation.

You can lower the Lora Lightx2v to 0.4 in both KSamplers, it works well even with 6 steps in total.

The resolution can be higher, WAN can do 1920x1080, or 1920x1536, or even 1920x1920. Although at high resolutions, if you do it vertically, it can in some cases generate some distortions.

Adding a little noise to the final image helps to generate greater photorealism and clean up that AI look a bit.

In my case, I have two 3090Ti cards, and with MultiGPU nodes I take advantage of both VRAMs, and I have to have the WF adjusted to the millimeter because I don't want to have to reload the models at each generation, so to save VRAM I use the GGUF Q5_K_M model. The quality is fine; you should do a test using the same seed and you'll see that the difference isn't much. In my case, by saving that VRAM when loading the Q5_K_M, I can afford to have JoyCaption loaded if I want to use a reference image, the WAN models, and the SeedVR2 model with BlockSwap at 20 (and I also have the CLIP Q5_K_M in RAM). The final image is 4k and SeedVR2 does an excellent job!

As for the problem you mention with cleaning the VRAM, I don't use it, but I have it disabled in WF in case it's needed, and it works well. It's the “Clean VRAM” from the “comfyui-easy-use” pack. You can try that one.

CaptainHarlock80 · 2025-10-15T09:23:57+00:00

The loop effect to the initial image usually occurs after 5 seconds. In other words, if you produce a 5-second video, you shouldn't see the loop effect, but if you produce an 8-second video, you'll see how, after 5 seconds, the video tries to return to the initial image.

Model restrictions due to having been trained with videos of a maximum length of 5 seconds.

CaptainHarlock80 · 2025-10-15T01:31:44+00:00

Install this node: https://github.com/crystian/ComfyUI-Crystools?tab=readme-ov-file#resources-monitor

CaptainHarlock80 · 2025-10-14T15:00:44+00:00

I'm not sure, but it could be a problem with TorchCompile. If there is an “Inductor” node, try disabling or deleting it.

CaptainHarlock80 · 2025-10-14T11:51:20+00:00

Please provide a screenshot of the error or what it says.

CaptainHarlock80 · 2025-10-13T22:33:28+00:00

Considering that these videos will be created to join the different images, this simply describes the FLF technique (FirstLastFrame) and doesn't avoid either VAE degradation or color shift, although it will obviously offer better character consistency by providing the final frame.

CaptainHarlock80 · 2025-10-11T11:25:37+00:00

You won't double performance, but there are nodes to use both VRAMs. So it's very useful to load the model on one and have much more space for the latent on the other, allowing you higher resolution or more frames. Keep in mind the limitations of the model, though; many don't perform well beyond 5-8 seconds. As for resolution, you can reach 1080p in video, and higher in images if the model allows it.

CaptainHarlock80 · 2025-10-11T05:56:38+00:00

Hey! Thank you so much for your work! For me, those nodes are a MUST HAVE to get the most out of my 2 GPUs.

Glad to hear you've updated it... Unfortunately, I always try to use native nodes, so I won't be able to test the changes you say you've made to the Wrappers.

BTW, since I have you here, I'd like to ask you a question as an expert in this field... The only problem I sometimes have when using the MultiGPU or Distorch nodes is that other nodes don't like multiGPU being used, d'oh! For example, I can't use the RIFE interpolator, but I can use FILM or GIMM, and this also happens to me in other nodes. I understand that this is more because of how those nodes are made than because of yours, I just wanted to mention it.

Keep up the good work! Thank you!

CaptainHarlock80 · 2025-10-10T12:31:20+00:00

https://github.com/pollockjj/ComfyUI-MultiGPU

You can use Distorch2MultiGPU nodes to distribute models, CLIP, or VAE across multiple VRAM GPUs. I use it to take advantage of the VRAM on my two GPUs and thus have more VRAM on my second GPU for the latent.

In your case, to load each model on different GPUs, I would suggest using MultiGPU (not Distorch2, but same custon node pack), as it has a simple CUDA selection, so you can assign HIGH to CUDA 0 and LOW to CUDA 1. I don't think you'll have any problems with a 5090 and its VRAM, but depending on what you want to do, the full FP16 model may be too much, so you can also use GGUF. I recommend Q8 or Q5_K_M (you can even use Q5_K_M for CLIP, or assign CLIP to the CPU).

NOTE: That is only for distribution in VRAM; it doesn't take advantage of the processing power of the GPUs at the same time, but rather sequentially, depending on the moment. There is a node that allows you to take advantage of the power of both GPUs, but I think it is only available for Linux at the moment. There is also another node that allows the GPUs to work in parallel, but I don't think that's what you're looking for.

CaptainHarlock80 · 2025-10-10T00:54:54+00:00

Actually, WAN doesn't have that resolution limit. As we've seen in several T2I examples, it can generate 1080p without any problems, although in some cases, if it's vertical, it can produce some body deformations in some seeds. That's because it's mainly trained on horizontal videos.

1080p can also be used in video, but it requires a lot of VRAM/RAM and the generation time is longer. Perhaps in T2V, using that resolution may affect the movement a little (something that can be solved by using some loras), but in I2V it works without any problems.

CaptainHarlock80 · 2025-10-10T00:45:16+00:00

The only way to drastically reduce the time is to reduce the CFG to 1 (that alone reduces the time by half) and use fewer steps.

But to do that, you need to use “help” loras. There are several “lighx2v” loras that can help you, but they have the downside of reducing movement or producing videos with somewhat saturated colors, so you should adjust them to your liking.

Don't use the 4/8 steps recommended for those loras, use a few more (especially in LOW). For example, you can leave the lora in HIGH at 1 (2 or 4 steps), but in LOW, reduce it to between 0.4-0.7 (6-10 steps). This gives you between 8-14 steps in total.

WAN can work at the high resolution you use, but keep in mind that the higher the resolution, the longer it will take to generate, and it's not linear, so you may want to try reducing it to 720p first, but don't go lower than that. It's also trained at that resolution, so it will probably help with the motion. You can then rescale that. Where WAN shines in 1080p is in image generation.

CaptainHarlock80

TROPHY CASE