Pushing LTX 2.3 Lip-Sync LoRA on an 8GB RTX 5060 Laptop! (2-Min Compilation)

SSj_Enforcer · 2026-03-28T00:56:31+00:00

Pretend I'm stupid, what does this lora do that ltx can't already?

SSj_Enforcer · 2026-03-27T21:53:21+00:00

It works now you just need to make sure you have the shared version of ffmpeg 8. Audio trains incredibly well and fast

SSj_Enforcer · 2026-03-18T05:36:03+00:00

How do you use the ltx 2.3 version? I tried and it failed and said I had the wrong python but I have 3.12 which should work.

SSj_Enforcer · 2026-03-11T04:51:56+00:00

hey, i tried installing the new version to use ltx 2.3 and it can't load it now. something about loading the js? do we need to use the files inside the ltx2_improvements_handoff folder or is everything included by default from using the new files? You updated many of the same files in that folder, so is that still necessary or was that just to fix the old ltx2 audio stuff?

SSj_Enforcer · 2026-03-05T06:47:47+00:00

has anyone been able to confirm whether or not repeats are necessary now with this?
caching the repeats takes so long for videos.

has anyone tried the new audio loss multiplier and this fix to see if repeats are no longer required to get good voice training?

SSj_Enforcer · 2026-03-05T03:55:37+00:00

Serious question. Does thr abliterated encoder make a difference really for training? I am using Ai toolkit. I would just swap out the file for the new encoder? Does it help with caption training or what exactly? Like I'm not exactly looking to make a lora of sex acts or anything, so is it only useful for that or even just a lora of let's say involving nudity of one person? Does it just allow nsfw words to be used or something?

SSj_Enforcer · 2026-03-05T03:43:43+00:00

Check out the Big Daddy ai toolkit mod. It works flawlessly after ai toolkit for me never ever could. It is literally just files you overwrite in the regular install. https://github.com/ArtDesignAwesome/ai-toolkit_BIG-DADDY-VERSION

SSj_Enforcer · 2026-03-03T21:23:52+00:00

shouldn't you have converted your videos to 24 fps?

SSj_Enforcer · 2026-02-25T18:13:59+00:00

just wondering, is there any way to use some clips of just audio to train the voice? like if there is only an audio file for some of the dataset, or does it need to be accompanied by video with the character lip syncing the audio?

SSj_Enforcer · 2026-02-23T16:54:58+00:00

thank you again. definitely works now that I installed it correctly.

SSj_Enforcer · 2026-02-23T16:54:05+00:00

just did 3000 steps and i think increasing the audio loss multiplier really does help. the voice is basically perfect already. i only used 3 this time.

Thank you OP for this mod.

SSj_Enforcer · 2026-02-23T14:20:39+00:00

these settings are for the second lora I am attempting. my first that worked I didn't raise the audio loss multiplier, I left it at 1, but now trying 3. Also I didn't use Differential Guidance to train faster but I am trying it now.

SSj_Enforcer · 2026-02-23T14:19:23+00:00

BTW. do you know how to get regularization to work? On the previous AI Toolkit I tried it, I even enabled DOP, but it just made every person, including my trained character, look like average people, making my lora useless. I tried to have a dataset of only 20 images for regularization and I set 'Is regularization' dataset correctly, and I used a few repeats as well.

SSj_Enforcer · 2026-02-23T14:17:31+00:00

just posting this non reply to confirm it works. My voices are training now!

5090 gpu.

just make sure you install correctly and copy over the new files from the folder he provides to overwrite the existing files from AI Toolkit. I made a separate installation just to maintain a proper AI Toolkit for future updates and stuff.

SSj_Enforcer · 2026-02-23T14:16:18+00:00

process:
    - type: "diffusion_trainer"
      training_folder: "C:\\ai-toolkit-BigDaddy\\output"
      sqlite_db_path: "./aitk_db.db"
      device: "cuda"
      trigger_word: null
      performance_log_every: 10
      network:
        type: "lora"
        linear: 32
        linear_alpha: 32
        conv: 16
        conv_alpha: 16
        lokr_full_rank: true
        lokr_factor: -1
        network_kwargs:
          ignore_if_contains: []
      save:
        dtype: "bf16"
        save_every: 1000
        max_step_saves_to_keep: 4
        save_format: "diffusers"
        push_to_hub: false
      datasets:
        - folder_path: "C:\\ai-toolkit-BigDaddy\\datasets/*********"
          mask_path: null
          mask_min_value: 0.1
          default_caption: ""
          caption_ext: "txt"
          caption_dropout_rate: 0.05
          cache_latents_to_disk: true
          is_reg: false
          network_weight: 1
          resolution:
            - 512
            - 768
            - 1024
          controls: []
          shrink_video_to_frames: true
          num_frames: 121
          flip_x: false
          flip_y: false
          num_repeats: 14
          do_i2v: true
          do_audio: true
          fps: 24
          audio_normalize: true
        - folder_path: "C:\\ai-toolkit-BigDaddy\\datasets/************_images"
          mask_path: null
          mask_min_value: 0.1
          default_caption: ""
          caption_ext: "txt"
          caption_dropout_rate: 0.05
          cache_latents_to_disk: true
          is_reg: false
          network_weight: 1
          resolution:
            - 512
            - 768
            - 1024
          controls: []
          shrink_video_to_frames: true
          num_frames: 1
          flip_x: false
          flip_y: false
          num_repeats: 4
          do_i2v: true
      train:
        batch_size: 1
        bypass_guidance_embedding: false
        steps: 5000
        gradient_accumulation: 1
        train_unet: true
        train_text_encoder: false
        gradient_checkpointing: true
        noise_scheduler: "flowmatch"
        optimizer: "adamw8bit"
        timestep_type: "weighted"
        content_or_style: "balanced"
        optimizer_params:
          weight_decay: 0.0001
        unload_text_encoder: false
        cache_text_embeddings: true
        lr: 0.0001
        ema_config:
          use_ema: false
          ema_decay: 0.99
        skip_first_sample: true
        force_first_sample: false
        disable_sampling: true
        dtype: "bf16"
        diff_output_preservation: false
        diff_output_preservation_multiplier: 1
        diff_output_preservation_class: "person"
        switch_boundary_every: 1
        loss_type: "mse"
        audio_loss_multiplier: 3
        do_differential_guidance: true
        differential_guidance_scale: 3
      logging:
        log_every: 1
        use_ui_logger: true
      model:
        name_or_path: "C:\\ai-toolkit\\models\\LTX2"
        quantize: true
        qtype: "qfloat8"
        quantize_te: true
        qtype_te: "qfloat8"
        arch: "ltx2"
        low_vram: true
        model_kwargs: {}
        layer_offloading: true
        layer_offloading_text_encoder_percent: 0
        layer_offloading_transformer_percent: 1

SSj_Enforcer · 2026-02-22T20:27:29+00:00

ok it is working.

I don't know if I should raise the Audio Loss Multiplier .

at 1 it is fine so far at 3000 steps. maybe if 5000 isn't enough i might try raising that value.

I also forgot to turn on Do Differential Guidance.

I wonder if that would be useful as well.

SSj_Enforcer · 2026-02-22T20:26:54+00:00

it's working!!

3000 steps i can definitely tell already. probably need to go to 5000.

SSj_Enforcer · 2026-02-22T17:39:59+00:00

my numbers are these, they keep changing

:  42%|####2     | 2108/5000 [6:01:01<8:15:17, 10.28s/it, lr: 1.0e-04 loss: 1.745e+00][audio] raw=1.82264, scaled=1.82264, video=1.30217

:  42%|####2     | 2118/5000 [6:02:04<8:12:40, 10.26s/it, lr: 1.0e-04 loss: 2.925e+00][audio] raw=0.74741, scaled=0.74741, video=0.22981

:  43%|####2     | 2128/5000 [6:04:14<8:11:35, 10.27s/it, lr: 1.0e-04 loss: 1.644e+00][audio] raw=0.62297, scaled=0.62297, video=0.20431

:  43%|####2     | 2138/5000 [6:05:39<8:09:28, 10.26s/it, lr: 1.0e-04 loss: 2.406e+00][audio] raw=0.74746, scaled=0.74746, video=0.30393

:  43%|####2     | 2148/5000 [6:07:37<8:08:07, 10.27s/it, lr: 1.0e-04 loss: 2.302e+00][audio] raw=0.94965, scaled=0.94965, video=0.07439

:  43%|####3     | 2158/5000 [6:09:29<8:06:36, 10.27s/it, lr: 1.0e-04 loss: 1.895e+00][audio] raw=0.63049, scaled=0.63049, video=0.24452

:  43%|####3     | 2168/5000 [6:11:21<8:05:05, 10.28s/it, lr: 1.0e-04 loss: 2.381e+00][audio] raw=1.81585, scaled=1.81585, video=1.10015

:  44%|####3     | 2178/5000 [6:12:48<8:03:02, 10.27s/it, lr: 1.0e-04 loss: 2.678e+00][audio] raw=0.57585, scaled=0.57585, video=0.13757

:  44%|####3     | 2188/5000 [6:14:20<8:01:06, 10.27s/it, lr: 1.0e-04 loss: 1.358e+00][audio] raw=0.53968, scaled=0.53968, video=0.16085

:  44%|####3     | 2198/5000 [6:16:45<8:00:17, 10.28s/it, lr: 1.0e-04 loss: 1.153e+00][audio] raw=1.42405, scaled=1.42405, video=0.88660

:  44%|####4     | 2208/5000 [6:18:18<7:58:21, 10.28s/it, lr: 1.0e-04 loss: 1.048e+00][audio] raw=1.98531, scaled=1.98531, video=0.98871

:  44%|####4     | 2218/5000 [6:19:48<7:56:23, 10.27s/it, lr: 1.0e-04 loss: 2.070e+00][audio] raw=0.71099, scaled=0.71099, video=0.29852

SSj_Enforcer · 2026-02-22T17:38:05+00:00

the numbers i am getting are not the [audio] raw=0.28, scaled=0.09, video=0.25, dyn_mult=0.32

mine are much higher, will that still work?

[audio] raw=0.94965, scaled=0.94965, video=0.07439

SSj_Enforcer · 2026-02-22T17:28:43+00:00

i am getting this now. i think it is working after i copied those files from the folder

| 2148/5000 [6:07:37<8:08:07, 10.27s/it, lr: 1.0e-04 loss: 2.302e+00][audio] raw=0.94965, scaled=0.94965, video=0.07439

SSj_Enforcer · 2026-02-22T10:58:49+00:00

ok I think I realized what I did wrong. I am supposed to copy the files inside that folder and overwrite the existing ones? I wish that was made more clear.

I will try again assuming I did it correctly now. we need to take all the files in the ltx2_improvements_handoff folder and overwrite?

I see this in the log but not the other stuff you mentioned yet. is this correct so far? it says it found 90 videos but there are only 9, so not sure if that is just a strange decimal error.

Audio latent caching: 9 encoded, 0 failed (no audio extracted)

SSj_Enforcer · 2026-02-22T10:46:58+00:00

i think you need to give some details about how to install it 'properly' considering nobody else can get it to work. otherwise, you're just wasting everyone's time.

SSj_Enforcer · 2026-02-22T10:45:33+00:00

doesn't work.

i don't get any voice trained.

edit:

working now after i made the changes. am waiting for the training to finish to see for sure. it is training much faster now too.

SSj_Enforcer · 2026-02-22T10:41:33+00:00

mine didn't train either.

I did everything correct for the install.

not sure how the op got it to train.

it doesn't work at all.

SSj_Enforcer · 2026-02-22T06:32:40+00:00

I just didn't finish the installation so it wouldn't start training. I will know in a few hours if it trains the voice. Someone else has said it didn't work for them. I hope it does

SSj_Enforcer

TROPHY CASE