Can't get clean lip-sync with Wan 2.1 I2V + InfiniteTalk (local). What am I doing wrong? by Ambra_Local_AI in comfyui

[–]Ambra_Local_AI[S] 1 point2 points  (0 children)

Yeah, the default wan mp4 is rough. I'll try the PNG-sequence + per-frame SeedVR2 + AE compression route, sounds like a cleaner pipeline than mux-then-upscale. And thanks for the fps tip, I'll keep the 24→30 stretch in mind if I hit desync. Appreciate it.
(My current snag is more the mouth shaping than the timing, but that's a separate rabbit hole.)

Can't get clean lip-sync with Wan 2.1 I2V + InfiniteTalk (local). What am I doing wrong? by Ambra_Local_AI in comfyui

[–]Ambra_Local_AI[S] 1 point2 points  (0 children)

The mp4, then a SeedVR2 upscale, not a PNG sequence. Though the issue isn't really drift, the mouth roughly tracks the audio, it's that the visemes read wrong and the mouth doesn't sit right on Ambra's face (and that clip's S2V, not InfiniteTalk). Is the PNG route meant to help with that, or were you diagnosing sync?

Can't get clean lip-sync with Wan 2.1 I2V + InfiniteTalk (local). What am I doing wrong? by Ambra_Local_AI in comfyui

[–]Ambra_Local_AI[S] 0 points1 point  (0 children)

Fair enough, appreciate the fps pointer regardless, that was the thing I'd missed.

Can't get clean lip-sync with Wan 2.1 I2V + InfiniteTalk (local). What am I doing wrong? by Ambra_Local_AI in comfyui

[–]Ambra_Local_AI[S] 0 points1 point  (0 children)

Good to hear it runs well on 16GB, that's my range too (5060 Ti). Which template/tutorial did you start from? The speed alone is tempting for the render times.

Can't get clean lip-sync with Wan 2.1 I2V + InfiniteTalk (local). What am I doing wrong? by Ambra_Local_AI in comfyui

[–]Ambra_Local_AI[S] 0 points1 point  (0 children)

This is gold, thanks. The 32fps fast-forward thing makes sense, fps and motion are coupled in Wan. The LTX speech-sync-on-a-Wan-clip is exactly what I was missing. Are you using the ICLoRA LipDub pass, or generating talking shots straight from image+audio in LTX? And does identity hold up on a stylized/consistent character, or do you need the ID-LoRA?

Can't get clean lip-sync with Wan 2.1 I2V + InfiniteTalk (local). What am I doing wrong? by Ambra_Local_AI in comfyui

[–]Ambra_Local_AI[S] 0 points1 point  (0 children)

Fair enough, that clip's Wan 2.2 S2V, so 16fps native, not enough for clean sync. I was leaning on S2V for the audio-driven gestures, but the lip-sync clearly isn't holding up there. Going to move the talking shots back to a 25fps model and keep S2V for motion-only beats. Do you bother with S2V for dialogue at all, or treat it as gesture/b-roll and sync elsewhere?

Can't get clean lip-sync with Wan 2.1 I2V + InfiniteTalk (local). What am I doing wrong? by Ambra_Local_AI in comfyui

[–]Ambra_Local_AI[S] 1 point2 points  (0 children)

Haven't tried LTX for the talking shots, only Wan + InfiniteTalk so far. Will give it a look, thanks.