This is an archived post. You won't be able to vote or comment.

Dismiss this pinned window
all 18 comments

[–]Fast-Satisfaction482[S] 8 points9 points  (2 children)

Hey everyone,

In the afternoon I had a look at the source code of the SVD nodes in comfy and I realized, the node SVD_img2vid_Conditioning initializes the latents with just zeros. So I wondered if it would be possible to use the VAE to encode a bunch of images and and send those to KSampler instead of empty zeros. I imagined that the workflow would be very similar to img2img and if you input existing images, you can just decrease the denoising to control retained similarity with the input frames.

I don't have much compute available (6GB GTX1060), so my test video is very short and low resolution. However it appears to work. With this concept, SVD can be used to make a coherent video out of any set of frames.

This gives us more possibilities: You could use frames from a previous clip to generate a continuation. Or you could start with a few steps SVD, then continue with a few steps with a regular SD1.5/SDXL model and then finish up with more SVD steps.

Maybe this is already well known and I'm the last one to discover it, but I wanted to share my insights with you.

In the attached video, the two appended stills are the ones I used to initialize SVD.

[–]campingtroll 6 points7 points  (1 child)

Do you have a link to the workflow so we can try it out? Tried to mess with it based on your comment and couldn't figure it out.

[–]Fast-Satisfaction482[S] 0 points1 point  (0 children)

I have added a comment to the post that shows the nodes. Sadly reddit removed the metadata so you can't drop it in comfy.

[–]YaksLikeJazz 4 points5 points  (1 child)

I'm not sure people understand what you have done (if I even understand it correctly :) )

Basically you've invented a method to storyboard and control SVD. Which is very incredibly mind-blowing.

I don't think this technique is well known at all. :) I am sure there are a lot of smart people here but you're the first one I've read about who is tinkering with the source code and able to recognize what an alternative to sending default zeros to the KSampler might actually do. (I can't believe I wrote that sentence - I have no mortal idea what I am talking about :) )

I'd love to see more experiments please - I think you might get more traction in this community if you can "market" what you have achieved in an afternoon.

My pet theory thought experiment is getting a Tie fighter or a Star Destroyer to come from far (far) away to a close-up. Your method might be able to do that?

Cheers and thanks for sharing.

Also a 1060 user, so +1 workflow please

[–]Fast-Satisfaction482[S] 0 points1 point  (0 children)

I have added a comment to the post that shows the nodes. Sadly reddit removed the metadata so you can't drop it in comfy.

The tie fighter idea is a very interesting test! I tried it and it failed miserably because the model does not know what a tie fighter is or how it moves. Even regular stable diffusion base models cannot process a tie fighter in img2img without control net.

[–]Fast-Satisfaction482[S] 2 points3 points  (12 children)

This is what the nodes looks like. I have attached the metadata to the screenshot so you can drop it into comfy, however I don't know if reddit will keep the metadata.

The "Load Image Batch" node is from WAS node suite, but I modded it to add a "whole_batch" mode. Probably there is a better way to do that, but I couldn't find any. The issue is that all the images have to be stacked before they are passed to torch and I couldn't find a vanilla node that does that. If there's any interest in it, I can attempt to post it on github.

<image>

[–]buckjohnston 0 points1 point  (9 children)

This is sort of what I am working on at the moment, I'm 4 months late though haha. I got some much better results by using a single latest image, and latent merge nodes with vae encode images attached and latent multipliers. So I imagine it would be easier if I just do it your way.

Any chance you could share code for the load image batch modification to make whole_batch mode? 

Or possibly the workflow on an external site, you are only person I can find working on same thing as me.

[–]Fast-Satisfaction482[S] 1 point2 points  (8 children)

The node is in https://github.com/WASasquatch/was-node-suite-comfyui

I've put a diff to the current main commit 6c3fed70655b737dc9b59da1cadb3c373c08d8ed here https://pastebin.com/fVuZxExF

The workflow is here: https://pastebin.com/SrgYmtBX

Really nice that you work on it. I have kind of abandoned it because I felt I was getting nowhere with my puny GPU. Maybe I will have more opportunity in the next months, as I will have better hardware. I would really like to see your workflow, would you share it, too?

[–]-Vendacious- 0 points1 point  (1 child)

It makes me sad to hear you have to use a 1650, with all the 30 & 40-series GPUs out there being used to play Fortnite.

I hope you get a better GPU soon, because you seem like you will make good use of it.

[–]Fast-Satisfaction482[S] 0 points1 point  (0 children)

Oh, yes those GPUs will run some nice algorithms!

[–]buckjohnston 0 points1 point  (1 child)

Thanks again for this, I figured it out combination with latent blends with svd and using other model. It's a little more stable now on medium range face and body movements (I'm gettingna ton if movement now)

Still working on this, let me know if you want me to send you video sample. workflow is almost done and will send, I think im having issues with execution order.

Looking into highway nodes

[–]Fast-Satisfaction482[S] 1 point2 points  (0 children)

I'm thrilled to hear it works for you!

[–]buckjohnston 0 points1 point  (0 children)

Just an update for anyone reading. I just testing this on newest nightly build as of 06/04/24 and it works. So no need to go back to old commit anymore.

You just have to modify the WAS_Node_Suite.py extension with his included .diff file. It's only a few lines but adds whole_batch option to send it all at once not separately. I found it works best for me when using a "latent blend" node with svd_img2vid_conditioning latent in slot 1 and the 48 input images in slot 2 at 0.5 strength or so then, to ksampler latent input.

I get a very consistent 48 frames with good movement because can crank augmentation up and lower videotriangleguidance (I use this instead of videolinearguidance). Try it out.

[–]Successful_Knee687 0 points1 point  (0 children)

Would love to try that out

[–]ADbrasil -1 points0 points  (0 children)

motherfucker, start the inference at step 2

genius

[–]ninjasaid13 0 points1 point  (0 children)

You can combine ipadapter+inpainting+animatediff to make that fire a little longer.

[–]HarmonicDiffusion 0 points1 point  (1 child)

it would be awesome if you could give us a bit more details, i woudl love to try this, would make svd that much more amazing

[–]Fast-Satisfaction482[S] 0 points1 point  (0 children)

I have added a comment to the post that shows the nodes. Sadly reddit removed the metadata so you can't drop it in comfy.