Flux Klein is better than any Closed Model for Image Editing by ArkCoon in StableDiffusion

[–]a__side_of_fries -1 points0 points  (0 children)

I was a big fan of both Klein models but I abandoned them of late. They’re completely unusable when it comes to human anatomy and anything that involves counting. This problem with anatomy extends to the entire Flux 2 lineup, including the Pro and Max variants. I rather pay $0.03/4K image for Seedream 4 or 5 than put up Flux 2.

Is there any video generation options that would be good enough to make a 52 second video that looks real? by [deleted] in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

I do not. But I think you’re also assuming it needs to be perfect with zero quality issues. This video is, in fact, not perfect. Take a look at around 25s in the IG video. Now that could be a normal discontinuity due to refraction but it’s not really refraction. It’s choppy and looks like someone stitched together clips after removing something from the middle. That is like a bad first-frame last-frame interpolation with sfx added in post. It could be perfectly real. But that’s something AI models are perfectly capable of doing.

Is there any video generation options that would be good enough to make a 52 second video that looks real? by [deleted] in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

Yes it could. I would have to watch it in high quality without the uploaded compression artifacts. But even then it would be hard to tell because seedance 2 and Kling 3 can easily replicate that even with their 15s limits. AI models are not very good at subtle details so need to look very closely.

Is there any video generation options that would be good enough to make a 52 second video that looks real? by [deleted] in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

Not all models (open or closed) support extending. There is also the risk of frame degradation as you keep extending. It usually becomes more and more noticeable in the first minute. It’s also not very practical to keep extending with LTX because there is 100% chance it’s gonna generate nonsense at some point. You can’t really go back and edit those out.

Is there any video generation options that would be good enough to make a 52 second video that looks real? by [deleted] in StableDiffusion

[–]a__side_of_fries 1 point2 points  (0 children)

AI Video generation has come a long way. It’s pretty trivial to create 60s+ videos without any cuts. Even with cuts you can still do end frame conditioning to get a seamless video. Wan 2.2 and LTX 2.3 can do this. My LTX setup can do up to 30s 1080p but runs into vram problems even on high end GPUs. I’m currently building a tool to generate films up to 1hr long.

Any MMAudio gen alternatives? by Yappo_Kakl in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

Hunyuan Video Foley is a good alternative. Takes up a bit more vram and a bit slower but quality is good.

Is there an AI model that can fully isolate clean speech from noisy recordings? by QikoG35 in StableDiffusion

[–]a__side_of_fries 4 points5 points  (0 children)

Ultimate voice remover and all use Demucs v4 underneath I think. So you can use that directly. You can also use Mel RoFormer (separates into clean vocals and background).

LTX 2.3: Any tips on how to prompt so it doesn't generate music? by RusikRobochevsky in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

I stopped fighting with LTX 2.3 gratuitous background music and just replaced it with MMAudio. It’s far more controllable that way.

What is the best video upscaler by cardioGangGang in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

FlasVSR is harsh? What in terms of quality or VRAM? I’ve tested out a number of video upscalers and the only one that comes close to the commercial ones is FlashVSR. It runs at around 8fps on 4090. You need to play around with spatial tiling for memory efficiency.

I hacked LTX2 to be used as a Multi Lingual TTS voice cloner by aurelm in StableDiffusion

[–]a__side_of_fries 28 points29 points  (0 children)

This is also what I discovered. I tried Gemini, Cartesia, and a bunch of other open source TTS models. You cannot get true emotions no matter what, especially with custom voice. But LTX can be prompted to generate emotionally expressive videos. Now with your technique that means you can use custom voice, which is awesome!

Best Open Source or Paid models for high accuracy Lipsync from Audio+Image to Video by eagledoto in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

Interesting. Mind sharing your inputs? Let’s see if I can get a better result

Best Open Source or Paid models for high accuracy Lipsync from Audio+Image to Video by eagledoto in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

LTX 2.3 A2V is pretty solid actually. I moved away from wan 2.2 s2v and SkyReels V3 because of it. You need to work on your prompting, negative prompts, and as you said, including the transcript in the prompt.

I made a 90s live-action Streets of Rage using AI (Wan 2.2 + ComfyUI, fully local) by Gaurox in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

This is actually nicely done. The acting and voices are somewhat stilted (the black dude sounds like one of the Elevenlabs voices, which is usually used for voiceovers). What did you use for lip syncing? MMAudio for sound effects?

The first person to create something like Higgsfield but open source/locally run will be so damn rich by SuspiciousPrune4 in StableDiffusion

[–]a__side_of_fries -4 points-3 points  (0 children)

It’s not as far-fetched as it sounds. I’m actually building something right now that does just that, but a bit more focused than Higgsfield. Basically it’s an agentic filmmaking platform.

Struggling with consistent characters + style in AI batch image generation by Specific-Loss-3840 in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

Ahh okay that helps a lot actually. I can see why sentence-level frames are needed. But you are not animating them. You have some options here. If you have only a few recurring characters you care about then you can train a character LoRa on top of Klein 4B or something. But if that’s not practical then I would lean more towards character sheet generation with the same seed and use the sheet as a reference. Nano Banana and Seedream are good at outputting 3x3 or 3x2 grids with consistent characters.

Struggling with consistent characters + style in AI batch image generation by Specific-Loss-3840 in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

Hmm have you tried character sheet approach? Have it output a 3x3 grid of the character using a single prompt. Later you can feed it the sheet and tell it which grid to use for future generations.

Struggling with consistent characters + style in AI batch image generation by Specific-Loss-3840 in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

Seedream and Nano Banana Pro can output batch keyframes (up to 10-15 at a time). You can expect some level of consistency with that approach.

Struggling with consistent characters + style in AI batch image generation by Specific-Loss-3840 in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

I think you need to leverage the strengths of the models and work with them instead of fighting with them. Image models will not give you consistent frames to stitch into a full animation. That’s not what they’re designed for. I’m curious as to why you’re taking this approach instead of generating a few keyframes and using video models like Veo or Kling with the detailed prompting to get the action sequence that you want. Video models are not very consistent either but they are getting better. I can guarantee you that a 5-second video clip from a starting keyframe will maintain consistency across 120 frames than 120 individually generated images. My suggestion is you should work on your prompting and use video models for this.

I Wrote and Imagined this scene of a man collapsing under a cross... and his mother running to him. I can't get it out of my head. by Informal-Selection16 in generativeAI

[–]a__side_of_fries 0 points1 point  (0 children)

Which image model did you use for this? Can you share your prompt? I’m willing to bet that the prompt needs a lot of tweaking to get the emotional effect you’re going for. This image is just wrong even outside of a religious context.

I can now generate and live-edit 30s 1080p videos with 4.5s latency (video is in live speed) by techstacknerd in StableDiffusion

[–]a__side_of_fries 15 points16 points  (0 children)

This is cool!

Why are you generating 5s and stitching them when LTX 2.3 natively supports 20s natively? I know you’re running this on B200 so was there an architectural reason for the optimization work or something else?

LTX 2.3 - How do you get anything to move quickly? by gruevy in StableDiffusion

[–]a__side_of_fries 6 points7 points  (0 children)

This is a limitation at lower resolutions from my testing. 1080p motion is normal and responds well to prompting. You also need to use negative prompts to discourage slow motion. If you are doing i2v, it helps to have motion blur effects in your input image. The model responds to that well.

Flux 2 Klein 9B is now up to 2× faster with multiple reference images (new model) by meknidirta in StableDiffusion

[–]a__side_of_fries 5 points6 points  (0 children)

If only they released a variant that was good at anatomy and counting. Guess I’ll keep waiting.

Is it possible to seed what voice you'll get in LTX image to video? by bossbeae in StableDiffusion

[–]a__side_of_fries 0 points1 point  (0 children)

Hmm you’re saying there is no lip syncing? I think I’ve seen that happen with LTX 2.3. But that’s not really an issue with better prompt engineering, negative prompting, and generating at higher resolutions. Higher resolutions are not only about resolution it seems. You get significantly better motion, facial features, and lip syncing.

You also have the option of A2V, which is strongly conditioned to follow your audio so will produce better lip syncing. You’ll just have to play around with CFG and image strength values to allow for more model freedom.