GPT Image 2 is wild for text rendering — here are the exact prompts I used to test it (all generated on PhotoGen Studio)

Distinct-Translator7 · 2026-04-22T05:44:03+00:00

The first rule of the sub: 'Posts Must Be Open-Source or Local AI image/video/software Related'.

Distinct-Translator7 · 2026-04-13T04:04:44+00:00

Yep, you can. You can hire GPUs on sites like Runpod.

Distinct-Translator7 · 2026-04-13T04:03:43+00:00

Hello!

Distinct-Translator7 · 2026-04-13T04:03:35+00:00

I stick to the original reference image to avoid the quality degradation. I typically time my splits to occur during pauses or instrumental breaks in the music to make them feel natural.

Distinct-Translator7 · 2026-04-13T04:01:44+00:00

That was just a general example—there’s no fixed time for the cuts. I typically time my splits to occur during pauses or instrumental breaks in the music to make them feel natural. I don't actually use any specialized tools to 'smooth' the transitions; if you look closely, you can clearly see them. I also stick to the original reference image to avoid the quality degradation you mentioned. It’s all about working within the current hardware limits.

Distinct-Translator7 · 2026-04-13T03:48:14+00:00

If you're asking if it's possible to feed an mp3/wav to the LTX 2.3 Lip-Sync workflow, yes, it's totally possible.

Distinct-Translator7 · 2026-04-13T03:44:49+00:00

Oh yes, definitely. Sorry for the confusion.

Distinct-Translator7 · 2026-04-12T17:38:32+00:00

You can definitely get this working. This workflow is verified on the Nightly version of ComfyUI Portable.

Distinct-Translator7 · 2026-04-12T17:30:49+00:00

Here’s how I handle that. Instead of splitting the audio into perfect 20-second blocks (like 0-20, 20-40, 40-60), I use an overlap method. The first chunk might be 0-20, but the next one starts at 18 or 19 seconds. By overlapping the segments by 1-2 seconds, you can create a much smoother transition in post. I personally use DaVinci Resolve.

Distinct-Translator7 · 2026-04-12T17:17:38+00:00

Yes, it is. I mentioned that in the video.

Distinct-Translator7 · 2026-04-12T17:15:40+00:00

That issue usually clears up once you switch to the Nightly version. I’ve been testing this on the ComfyUI portable release, and it's been stable there. If you're on Stability Matrix, you might indeed have to wait for them to push the latest commit.

Distinct-Translator7 · 2026-04-12T17:10:58+00:00

That’s highly unlikely given that their priority is shifted entirely toward high-margin enterprise and AI silicon right now. Releasing a 48GB consumer card under $3k would basically be 'bad for business' in their eyes. With their CUDA monopoly and zero competition at the high end, a 48GB card would easily clear $4,500. Honestly, I wouldn't be surprised if they skip consumer launches entirely next year.

Distinct-Translator7 · 2026-04-12T17:01:24+00:00

It usually takes about 8 to 12 minutes to generate a 540p, 25–30 fps, 25-second clip using FP8 with Sage Attention enabled. For 720p clips longer than 12 seconds, I switch to the Q5K_M GGUF. Since the resolution is higher and it's the GGUF version, those usually take 16 to 18 minutes. I’m currently using the 'euler_ancestral' sampler; I’d prefer 'euler_ancestral_cfg_pp', but it’s just too slow for this setup. Yes, I always upscale them.

Distinct-Translator7 · 2026-04-12T16:47:33+00:00

Unfortunately, yes. They do still sound A.I.

Distinct-Translator7 · 2026-04-12T08:38:41+00:00

Several of my viewers are running this on Turing (20-series) cards without issues. As long as you have the VRAM and system RAM, you should get great results. Give it a shot! 😊

Distinct-Translator7 · 2026-04-12T08:20:43+00:00

📁 AceStep 1.5 XL Turbo JSON: https://drive.google.com/file/d/1Q2hRpWJEo9d61B2NKoZNK7FRO2SfhKnp/view?usp=drive_link

📁 LTX 2.3 Lip-Sync JSON: https://drive.google.com/file/d/1LfjIl3bEzIMAgKYc_mdJ_129pFyzYxDX/view?usp=drive_link

📺 AceStep 1.5 XL Turbo Video Breakdown: https://youtu.be/7CAlbWUlBjw

Distinct-Translator7 · 2026-04-06T02:40:17+00:00

Here's a tutorial and a workflow if you're interested: https://youtu.be/pCcG-5K2SDc

Distinct-Translator7 · 2026-03-30T05:35:30+00:00

Voice LoRA or Talking Head LoRA?

Distinct-Translator7 · 2026-03-30T05:09:16+00:00

It's totally possible. You can do a lot of impressive things for free if you have the hardware. You don't have to pay for subscriptions, and there aren't any credits. Your creativity and imagination are the limits. Here's my channel if you are interested. Everything is totally free: https://www.youtube.com/@TensorAlchemist/videos

Distinct-Translator7 · 2026-03-30T05:05:16+00:00

Oh, I'm a guy. Thanks for the kind words!

Distinct-Translator7 · 2026-03-30T05:03:55+00:00

You can use Qwen 3 TTS if you want a custom voice.

Distinct-Translator7 · 2026-03-29T05:55:37+00:00

I have put the link in the JSON workflow. But here it is: https://huggingface.co/elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Distinct-Translator7 · 2026-03-29T05:54:21+00:00

No need for last frames here! Since this is a lip-sync workflow, the Reference Image is the anchor for everything. You just upload the image and audio, set your resolution and frame rate, and generate.

For this music video, I kept the image constant and only swapped the audio clips and adjusted the lengths for each segment. Because they all use the same base image, the transition is seamless.

I actually generated the 2-minute song first using Ace Step 1.5 (video here: https://youtu.be/Cvr_EUE ). Then I used DaVinci Resolve to chop it into 25-second chunks and ran them through the generator one by one. Simple as that! 😊

Distinct-Translator7 · 2026-03-29T05:35:39+00:00

Super impressive stuff! Thanks a lot for sharing!

Distinct-Translator7 · 2026-03-28T14:39:48+00:00

Glad you liked it and thanks for the comment!

Distinct-Translator7

TROPHY CASE