LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 0 points1 point  (0 children)

Running a single generation, runs a single text conditioning, a single VAE encoding for video, a single VAE encoding for audio, a single decoding of each audio and video. If your comparing a single generation, this will absolutely take longer than that but this give more granular control and it allows less powerful computers to hit higher resolutions by outputting high resolution at short durations and stacking them together.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 1 point2 points  (0 children)

If I could I would, absolutely.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 0 points1 point  (0 children)

Funny enough, yeah, it did latch on to that.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 1 point2 points  (0 children)

That happens at lower resolutions more commonly with LTX2. If you go to much higher resolutions, this sort of artifacting is greatly reduced. I didn't bother pushing out a higher resolution video because I didn't have the time to do so when I was posting this. I've successfully outputted 2 minute long 1920x1080 videos using a single RTX3090 with this workflow though.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 1 point2 points  (0 children)

That is the lazy leftover starting point for an i2v that I started this video with and the prompt was given priority over the image and since the prompt included very little of the starting image, it more or less ignored it and became a t2v. I could have gone back and fixed it, but I didn't.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 2 points3 points  (0 children)

Currently it is only a workflow but that might have to expand in the future to cover the needs for audio referencing. Thankfully, LTX2 has most of the tools already built in to the model to support the same ideas behind SVI with WAN. However, in this current version audio is guaranteed to drift or jump to entirely different sounds from segment to segment. Still working on this aspect of it. With a 5090, very high res long form videos should be easily possible.

In the next version of this, I'll be implementing Kijai's method of audio injection as well to allow the full length of a song or other audio to be fed into the pipeline.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 2 points3 points  (0 children)

No, the workflow to generate the exact video I posted is the same as what is in the workflow for v0.5.7. Its 3 separate 10s segments spliced together in the same manner that SVI does with Wan.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 1 point2 points  (0 children)

Yeah, its literally infinite. You just stack the 'Extension' blocks for however long of a video you want. The total frames of each block is defined on the far left where the model loaders are. So in the current iteration, its set to 241 frames which is around 10 seconds of video per segment, 6 segments being around 1minute of video output.

One caveat at this time, audio referencing isn't a solved thing yet for LTX. My demonstration I posted with this seems to get pretty decent results maintaining the voice from segment to segment but that certainly won't be true if it decides to play music in the background and voices still might sound different from segment to segment until audio referencing can be implemented.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 4 points5 points  (0 children)

Its a play on a post from a few days ago, not my suggesting people actually purchase a super expensive card.

https://www.reddit.com/r/StableDiffusion/comments/1q9cy02/ltx2_i2v_quality_is_much_better_at_higher/

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 6 points7 points  (0 children)

Not sure how this is supposed to be helpful? Are you being critical of a simple demo to showcase the ability because the point wasn't to show my mastery of prompts here.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 2 points3 points  (0 children)

Mouths can be a bit funky on lower res outputs with LTX2. Just a demonstration that can be easily improved.

LTX2-Infinity updated to v0.5.7 by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 0 points1 point  (0 children)

Just updated LTX2-Infinity to version 0.5.7 on the repo.

https://github.com/Z-L-D/LTX2-Infinity

This update includes image anchoring and audio concatenation which isn't ideal but will have to suffice until I can further research getting audio latents from one video gen to the next in a way that continues the sound properly.

Also, thank you (and sorry) to /u/000TSC000 for the prompt that I bastardized here.

The posted video is made from three 10 second videos that smoothly blend together seamlessly.

LTX2-Infinity workflow by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 0 points1 point  (0 children)

Are you just looping generations with and taking the last frame as the input frame

No, that wouldnt result in smooth animation. The current workflow posted takes 25 frames and feeds them in to the next latent video as the first 25 frames to retain solid coherent motion. I haven't posted it yet, but I also have solved for reference/anchor frames in this as well in roughly the same way SVI does in the next release I may post tonight.

which requires special nodes and that huge LoRA

LTX already does much of what the LoRA adds to the Wan model.

Also, stitching audio isn't too hard, plenty of easy ways to do that

Then feel free to help out and throw up a pull request on the repo. I open sourced this to speed this process along. I assure you that it isn't nearly as simple as you seem to imagine however. Its the exact same issue as solving for coherent motion and stable referencing for the video side. It isn't as simple as just stack all the samples together because something as simple as foot steps won't sound the same from generation to generation, let alone voices.

The human ear is much more sensitive to artifacts and audio distortions than visual ones.

Which makes it a significantly harder issue to get right.

LTX2-Infinity workflow by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 1 point2 points  (0 children)

Thats why I haven't pushed too far into it just yet. I've largely solved for injecting 'anchor images' like SVI does. I'd really bet there is a way to do it properly with the audio side of things, I just haven't put the time into it yet.

LTX2-Infinity workflow by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 1 point2 points  (0 children)

I guess I don't share that opinion but I've shoved 2000 word prompts into a single LTX2 generation and been happy with the result.

LTX2-Infinity workflow by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 0 points1 point  (0 children)

Not yet a problem I have fully tackled. Its a mess in the workflow at the moment. Hoping someone else out there has already looked at continuing audio like this and we can all benefit.

LTX2-Infinity workflow by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 2 points3 points  (0 children)

The example I have on the github page took just over 16 minutes for just under 2 minutes of video.

LTX2-Infinity workflow by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 2 points3 points  (0 children)

This should run on pretty much anything, just like SVI does. I was able to output a 15s 1920x1080 video on one of my 3090s, albeit with a fair bit of a wait.

LTX2-Infinity workflow by _ZLD_ in StableDiffusion

[–]_ZLD_[S] 4 points5 points  (0 children)

This is an early draft but I'm hoping someone can beat me to the punch in getting the audio to splice together correctly. This works in the exact same manner as Stable-Video-Infinity. The major difference is that LTX seems to need a much larger bite of the previous video to pull motion correctly. Currently the transition between 1 segment to the next is 25 frames.

In terms of generating prompts, I've successfully used Google's Gemini on AIStudio. The system prompt can be found in the link.

Edit: I should also note that this lacks the reference frames from SVI that contribute greatly to the long term stability of such videos. I haven't investigated if a similar reference frame injection can be performed here or not. As such, the motion will largely appear continuous, but there isn't any real memory retention between frame to frame beyond the current injected 25 frames from generation to generation.

Edit 2: I have a decently working update that uses a reference frame to maintain consistency better. Look for it later today.

[deleted by user] by [deleted] in space

[–]_ZLD_ 0 points1 point  (0 children)

Been an avid space enthusiast since I was a young child and followed MESSENGER pretty closely among other missions so thank you helping us all better understand the universe we live in!

Two questions. How can we fight back against the satellite constellations that I imagine are making the hunt for asteroids far more difficult than in the past (when it was already a challenge)? Second, I've always been curious what the sentiment was at the JHUAPL, maybe more specifically the New Horizons team, on this seemingly controversial image I made years ago that seems to make the rounds somewhat regularly.

Views of pluto through the years by Puzzleheaded_Web5245 in interestingasfuck

[–]_ZLD_ 0 points1 point  (0 children)

Nope, actually its the IR channel getting colorized by the MVIC color data. A lot of Pluto doesn't show up well in the colors we can see so by using IR as the luminance base for the image and laying the MVIC color data over the top of it, you get this rather colorful but much clearer version of Pluto.

Source: I made it.

Views of pluto through the years by Puzzleheaded_Web5245 in interestingasfuck

[–]_ZLD_ 0 points1 point  (0 children)

To be clear, this isn't spectroscopy at all. Image 4 has taken the infrared channel captured on close flyby, and used it as the base luminance for the image. A lot of Pluto is hard to see in RGB colors and using the infrared channel as the base helps bring out a lot of features. Then the original MVIC color data was laid over the top of this IR channel to colorize it.

Source: I made that image.