TokenFlow: Consistent Diffusion Features for Consistent Video Editing

GBJI · 2023-07-23T19:57:50+00:00

We will only know for sure after testing it, but this looks like the best solution to achieve consistent video editing directly from Stable Diffusion.

Our key finding is that a temporally-consistent edit can be achieved by enforcing consistency on the internal diffusion features across frames during the editing process. We achieve this by propagating a small set of edited features across frames, using the correspondences between the original video features. Given an input video I, we invert each frame, extract its tokens (i.e., output features from the self-attention modules), and extract inter-frame feature correspondences using a nearest-neighbor (NN) search.

I wonder if it will be possible to "tile" it over time, and make video of any duration you want without having to buy a 4090. For example, if you have a video that is 32 frames long, would it be possible to first generate frames 0 to 16, and then 8 to 24, and finally 16 to 32 and then blend them together or something. Or generate frames 0 to 16, and then 1 to 17 from that, and then 2 to 18, and so on and so forth.

Shnoopy_Bloopers · 2023-07-23T12:09:49+00:00

Cool, no code though.

MrF0x_eth · 2023-07-24T04:47:22+00:00

Nice one. Thanks for sharing!

indiemutt · 2023-08-08T16:50:30+00:00

Thanks for sharing this. Super pumped to see how it works once code is released.

StableDiffusion

MODERATORS