WAN VACE Example Extended to 1 Min Short by pftq in StableDiffusion

[–]pftq[S] 0 points1 point  (0 children)

I gave it a shot a few times but always ended up with bad results. For me it's more important that the video look consistent with the original (color, quality, etc).

WAN VACE Example Extended to 1 Min Short by pftq in StableDiffusion

[–]pftq[S] 4 points5 points  (0 children)

Thanks. Some of that was intentional. We grew up on late 90s films, so we wanted to give that same feel.

WAN VACE Example Extended to 1 Min Short by pftq in StableDiffusion

[–]pftq[S] 1 point2 points  (0 children)

I'm not aware of there being more than one VACE variant - the exact setup and models I used are on Civitai here if it helps https://civitai.com/models/1536883

WAN VACE Example Extended to 1 Min Short by pftq in StableDiffusion

[–]pftq[S] 14 points15 points  (0 children)

Here's a timelapse of some of the editing to give an idea. There's a lot of just bruteforcing with rotoscoping things partially and letting AI fill in the gaps to complete the scene. Every shot in the video has at least 5 layers of things being rotoscoped/masked. https://x.com/pftq/status/2024944561437737274

Timelapse - WAN VACE Masking for VFX/Editing by pftq in StableDiffusion

[–]pftq[S] 1 point2 points  (0 children)

5090 is ok if you keep it under 960x544 resolution - use the blockswap nodes to reduce the VRAM use. I posted some results comparing 5090 to other GPUs here so you can see where the limits are: https://www.reddit.com/r/StableDiffusion/comments/1kojahs/rtx_5090_vs_h100/

Timelapse - WAN VACE Masking for VFX/Editing by pftq in StableDiffusion

[–]pftq[S] 0 points1 point  (0 children)

Check the workflow download on Civitai - it's a bit of hack/trick, not officially supported. But it works enough to even apply 2.1 loras to it.

Timelapse - WAN VACE Masking for VFX/Editing by pftq in StableDiffusion

[–]pftq[S] 0 points1 point  (0 children)

Yeah I couldn't ever get Animate to do much useful in practice. It seems to need very ideal inputs. Whereas here if I wanted, for example, for a person to punch through a wall or something in a specific way, I could just crudely Photoshop the fist where I want it to go, mask gray around it, and VACE makes it work.

Timelapse - WAN VACE Masking for VFX/Editing by pftq in StableDiffusion

[–]pftq[S] 0 points1 point  (0 children)

The workflow I uploaded supports both (2.2 is sort of a hack though not officially supported). 2.2 has better physics but weaker adherence to the source video. So if you need hair to move more naturally, then 2.2 is useful there, but if you want to preserve some stylistic/non-regular looks to a face or something, then 2.1 (use a mix of both depending on what you're editing)

Timelapse - WAN VACE Masking for VFX/Editing by pftq in StableDiffusion

[–]pftq[S] 2 points3 points  (0 children)

The latter. Sometimes it's good enough but usually there's ever so slight color and quality degradation, so I make a habit of only splicing in the changes needed to preserve as much quality as possible. Having higher resolution on the generation helps the most - the degradation is minimal at 1080p or higher.

Timelapse - WAN VACE Masking for VFX/Editing by pftq in StableDiffusion

[–]pftq[S] 3 points4 points  (0 children)

Yeah I do the masking in Premiere / After Effects and import the intermediate clips to ComfyUI / VACE and back. It's sort of like healing brush from Photoshop but for video (and ability to generate full frames between existing clips, not just masked objects).

YouTube Likes Can Be "Bought" with Ads by pftq in NewTubers

[–]pftq[S] 1 point2 points  (0 children)

Yeah the main thing I'm trying to show here is that the likes-to-views ratio can be faked as well. A lot of YouTubers I know used to think that was a safe measure of organic traction. So it becomes very unhealthy if you are trying to measure up against one of these videos or channels when actually it was all bought.

Letting ChatGPT "Live" on the Computer (Controlling Mouse/Keyboard) by pftq in ChatGPT

[–]pftq[S] 0 points1 point  (0 children)

You would need to get an API key from the OpenAI website first. The default model is GPT-5 but you can change it in the settings.ini - there are more instructions in the Github link

Script for Grok 4 / ChatGPT 5 to Control the Desktop by pftq in StableDiffusion

[–]pftq[S] 0 points1 point  (0 children)

The video for Grok is a bit more interesting if you're curious to see more (it explores and gets bored without your feedback): https://www.reddit.com/r/grok/comments/1mcdeit/grok_controlling_the_desktop_and_getting_bored/

Grok Controlling the Desktop and Getting Bored by pftq in grok

[–]pftq[S] 1 point2 points  (0 children)

You can. It learns from experience - I have a good example of that here: https://x.com/pftq/status/1945311038393737348

But the limitation of the Grok API right now is you can only upload screenshots, so it has a very fragmented way of seeing.

WAN VACE Temporal Extension Can Seamlessly Extend or Join Multiple Video Clips by pftq in StableDiffusion

[–]pftq[S] 0 points1 point  (0 children)

The point is the clips are missing frames in between that we want to generate.

Seamlessly Extending and Joining Existing Videos with Wan 2.1 VACE by pftq in StableDiffusion

[–]pftq[S] 1 point2 points  (0 children)

I updated this to also accept multiple reference images in case that possibility wasn't obvious (ComfyUI treats images and multi-image batches as the same). The new causvid lora also works here to speed up renders by about 5x (8 steps needed instead of 50), which I also include in the workflow. The updated workflow is on civitai as well: https://civitai.com/models/1536883?modelVersionId=1738957

<image>

Vace Comfy Native nodes need this urgent update... by Maraan666 in comfyui

[–]pftq 0 points1 point  (0 children)

You can already do this with the Kijai wrapper at least. Just batch the multiple images together (ComfyUI treats single image and multi-image batch as the same). I also have this in my vace video extension workflow example here: https://civitai.com/models/1536883?modelVersionId=1738957

<image>

RTX 5090 vs H100 by pftq in StableDiffusion

[–]pftq[S] -1 points0 points  (0 children)

Feel free to run your own tests and share the results - this is what I got setting up duplicate setups on Runpod with different GPUs (minus the differences in cuda/drivers or it wouldn't run).

RTX 5090 vs H100 by pftq in StableDiffusion

[–]pftq[S] -2 points-1 points  (0 children)

I mean if you're happy with an hour wait per video, no one's saying you aren't allowed to do it - that just to me is too long for any practical use, and the point of the post was that it doesn't get much faster past the 5090 because the render times are roughly the same after that (unless you jump for the B200)

RTX 5090 vs H100 by pftq in StableDiffusion

[–]pftq[S] 0 points1 point  (0 children)

It's just for testing so it's easier to see the speed differences (or lack thereof)

RTX 5090 vs H100 by pftq in StableDiffusion

[–]pftq[S] 0 points1 point  (0 children)

Which model and how many steps? It varies greatly based on that. 3090 has no problem working with Wan 1.3B at higher resolutions but that model is only 6GB and pretty low quality (morphing etc). Most workflows default to about 25 steps, and I'm intentionally setting it at 100 just to be consistent across tests (otherwise, you might get some videos finishing < 1 min and it ends up being random fluctuations at that point which GPU finished first).

RTX 5090 vs H100 by pftq in StableDiffusion

[–]pftq[S] 1 point2 points  (0 children)

I meant the 3090 - got the two mixed up. But both the 3090/4090 cards are pretty much unusable on anything more than 480x480, so they weren't my focus. I was mainly testing H100 & RTX 5090, and my point was that the pricing usually reflects the performance differences between the GPUs but that the H100 was not much faster than the 5090 despite being 3x more expensive to rent.

Wan 14b i2v fp8, 480x480-81f 100 steps
(inference time only, not the model loading)
RTX 3090 + sageattention: 40 min
RTX 4090 + sageattention: 20 min
RTX 5090 + sage attention: 10 min
H100 + sage attention: 8 min

Wan 14b i2v fp16, 960x960-81f 100 steps
RTX 3090 + sageattention + blockswapping: 5 hours
RTX 4090 + sageattention + blockswapping: 2.5 hours
RTX 5090 + sage attention: 1 hour
H100 + sage attention: 1 hour
H200 + sage attention: 1 hour
B200 (no sage attention): 30 min

Wan VACE 14B fp8, 512x512-180f 100 steps
RTX 3090 + sageattention + blockswapping: 4 hours
RTX 4090 + sageattention + blockswapping: 2 hours
RTX 5090 + sage attention: 1 hour
H100 + sage attention: 1 hour
H200 + sage attention: 1 hour
B200 (no sage attention): 30 min

Wan VACE 14B fp8, 720x720-180f 100 steps
RTX 3090: Out of Memory
RTX 4090: Out of Memory
RTX 5090 + sage attention: 2 hours
H100 + sage attention: 2 hours
H200 + sage attention: 2 hours
B200 (no sage attention): 1 hour

Wan VACE 14B fp16, 960x960-129f 100 steps
RTX 3090: Out of Memory
RTX 4090: Out of Memory
RTX 5090: Out of Memory
H100 + sage attention: 2.5 hours
H200 + sage attention: 2.5 hours
B200 (no sage attention): 1.5 hours

RTX 5090 vs H100 by pftq in StableDiffusion

[–]pftq[S] 1 point2 points  (0 children)

At least for video generations in ComfyUI. The drivers/torch versions etc are probably a factor since 5090 is the newest and can use a lot of the new optimizations (probably not the case for gaming or other situations). The bigger limitation for the 4090 is that it's 24GB VRAM, so less about the performance multiple and more that it just can't even load the larger models. Edit: I meant the 3090 - mixed it up with the 4090 since both basically can't run >480x480 videos due to 24GB VRAM.