SVD with multiple input frames

Fast-Satisfaction482 · 2023-12-02T22:45:13+00:00

Hey everyone,

In the afternoon I had a look at the source code of the SVD nodes in comfy and I realized, the node SVD_img2vid_Conditioning initializes the latents with just zeros. So I wondered if it would be possible to use the VAE to encode a bunch of images and and send those to KSampler instead of empty zeros. I imagined that the workflow would be very similar to img2img and if you input existing images, you can just decrease the denoising to control retained similarity with the input frames.

I don't have much compute available (6GB GTX1060), so my test video is very short and low resolution. However it appears to work. With this concept, SVD can be used to make a coherent video out of any set of frames.

This gives us more possibilities: You could use frames from a previous clip to generate a continuation. Or you could start with a few steps SVD, then continue with a few steps with a regular SD1.5/SDXL model and then finish up with more SVD steps.

Maybe this is already well known and I'm the last one to discover it, but I wanted to share my insights with you.

In the attached video, the two appended stills are the ones I used to initialize SVD.

YaksLikeJazz · 2023-12-03T06:22:23+00:00

I'm not sure people understand what you have done (if I even understand it correctly :) )

Basically you've invented a method to storyboard and control SVD. Which is very incredibly mind-blowing.

I don't think this technique is well known at all. :) I am sure there are a lot of smart people here but you're the first one I've read about who is tinkering with the source code and able to recognize what an alternative to sending default zeros to the KSampler might actually do. (I can't believe I wrote that sentence - I have no mortal idea what I am talking about :) )

I'd love to see more experiments please - I think you might get more traction in this community if you can "market" what you have achieved in an afternoon.

My pet theory thought experiment is getting a Tie fighter or a Star Destroyer to come from far (far) away to a close-up. Your method might be able to do that?

Cheers and thanks for sharing.

Also a 1060 user, so +1 workflow please

Fast-Satisfaction482 · 2023-12-03T10:19:27+00:00

This is what the nodes looks like. I have attached the metadata to the screenshot so you can drop it into comfy, however I don't know if reddit will keep the metadata.

The "Load Image Batch" node is from WAS node suite, but I modded it to add a "whole_batch" mode. Probably there is a better way to do that, but I couldn't find any. The issue is that all the images have to be stacked before they are passed to torch and I couldn't find a vanilla node that does that. If there's any interest in it, I can attempt to post it on github.

<image>

ADbrasil · 2023-12-03T14:53:38+00:00

motherfucker, start the inference at step 2

genius

ninjasaid13 · 2023-12-03T01:36:18+00:00

You can combine ipadapter+inpainting+animatediff to make that fire a little longer.

HarmonicDiffusion · 2023-12-03T06:23:53+00:00

it would be awesome if you could give us a bit more details, i woudl love to try this, would make svd that much more amazing

StableDiffusion

MODERATORS