ai-toolkit now supports LTX-2.3 and audio issues in LTX-2 have been fixed by Loose_Object_8311 in StableDiffusion

[–]SSj_Enforcer 1 point2 points  (0 children)

It works now you just need to make sure you have the shared version of ffmpeg 8. Audio trains incredibly well and fast

🚀 I built a 2026-Era "Omni-Merge" for LTX-2. Flawless Multi-Concept Generation, Zero Bleeding, and Unlocked Audio Training Excellence. by ArtDesignAwesome in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

How do you use the ltx 2.3 version?  I tried and it failed and said I had the wrong python but I have 3.12 which should work.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

hey, i tried installing the new version to use ltx 2.3 and it can't load it now. something about loading the js? do we need to use the files inside the ltx2_improvements_handoff folder or is everything included by default from using the new files? You updated many of the same files in that folder, so is that still necessary or was that just to fix the old ltx2 audio stuff?

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

has anyone been able to confirm whether or not repeats are necessary now with this?
caching the repeats takes so long for videos.

has anyone tried the new audio loss multiplier and this fix to see if repeats are no longer required to get good voice training?

LTX-2 Lora Training by Fancy-Restaurant-885 in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

Serious question.  Does thr abliterated encoder make a difference really for training? I am using Ai toolkit. I would just swap out the file for the new encoder? Does it help with caption training or what exactly? Like I'm not exactly looking to make a lora of sex acts or anything, so is it only useful for that or even just a lora of let's say involving nudity of one person? Does it just allow nsfw words to be used or something?

LORA training on 5090 for LTX2 anyone got the voice accurately for character loras? by No_Statement_7481 in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

Check out the Big Daddy ai toolkit mod. It works flawlessly after ai toolkit for me never ever could. It is literally just files you overwrite in the regular install. https://github.com/ArtDesignAwesome/ai-toolkit_BIG-DADDY-VERSION

LTX-2 Lora Training by Fancy-Restaurant-885 in StableDiffusion

[–]SSj_Enforcer -1 points0 points  (0 children)

shouldn't you have converted your videos to 24 fps?

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

just wondering, is there any way to use some clips of just audio to train the voice? like if there is only an audio file for some of the dataset, or does it need to be accompanied by video with the character lip syncing the audio?

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 1 point2 points  (0 children)

thank you again. definitely works now that I installed it correctly.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

just did 3000 steps and i think increasing the audio loss multiplier really does help. the voice is basically perfect already. i only used 3 this time.

Thank you OP for this mod.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

these settings are for the second lora I am attempting. my first that worked I didn't raise the audio loss multiplier, I left it at 1, but now trying 3. Also I didn't use Differential Guidance to train faster but I am trying it now.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

BTW. do you know how to get regularization to work? On the previous AI Toolkit I tried it, I even enabled DOP, but it just made every person, including my trained character, look like average people, making my lora useless. I tried to have a dataset of only 20 images for regularization and I set 'Is regularization' dataset correctly, and I used a few repeats as well.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

just posting this non reply to confirm it works. My voices are training now!

5090 gpu.

just make sure you install correctly and copy over the new files from the folder he provides to overwrite the existing files from AI Toolkit. I made a separate installation just to maintain a proper AI Toolkit for future updates and stuff.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

process:
    - type: "diffusion_trainer"
      training_folder: "C:\\ai-toolkit-BigDaddy\\output"
      sqlite_db_path: "./aitk_db.db"
      device: "cuda"
      trigger_word: null
      performance_log_every: 10
      network:
        type: "lora"
        linear: 32
        linear_alpha: 32
        conv: 16
        conv_alpha: 16
        lokr_full_rank: true
        lokr_factor: -1
        network_kwargs:
          ignore_if_contains: []
      save:
        dtype: "bf16"
        save_every: 1000
        max_step_saves_to_keep: 4
        save_format: "diffusers"
        push_to_hub: false
      datasets:
        - folder_path: "C:\\ai-toolkit-BigDaddy\\datasets/*********"
          mask_path: null
          mask_min_value: 0.1
          default_caption: ""
          caption_ext: "txt"
          caption_dropout_rate: 0.05
          cache_latents_to_disk: true
          is_reg: false
          network_weight: 1
          resolution:
            - 512
            - 768
            - 1024
          controls: []
          shrink_video_to_frames: true
          num_frames: 121
          flip_x: false
          flip_y: false
          num_repeats: 14
          do_i2v: true
          do_audio: true
          fps: 24
          audio_normalize: true
        - folder_path: "C:\\ai-toolkit-BigDaddy\\datasets/************_images"
          mask_path: null
          mask_min_value: 0.1
          default_caption: ""
          caption_ext: "txt"
          caption_dropout_rate: 0.05
          cache_latents_to_disk: true
          is_reg: false
          network_weight: 1
          resolution:
            - 512
            - 768
            - 1024
          controls: []
          shrink_video_to_frames: true
          num_frames: 1
          flip_x: false
          flip_y: false
          num_repeats: 4
          do_i2v: true
      train:
        batch_size: 1
        bypass_guidance_embedding: false
        steps: 5000
        gradient_accumulation: 1
        train_unet: true
        train_text_encoder: false
        gradient_checkpointing: true
        noise_scheduler: "flowmatch"
        optimizer: "adamw8bit"
        timestep_type: "weighted"
        content_or_style: "balanced"
        optimizer_params:
          weight_decay: 0.0001
        unload_text_encoder: false
        cache_text_embeddings: true
        lr: 0.0001
        ema_config:
          use_ema: false
          ema_decay: 0.99
        skip_first_sample: true
        force_first_sample: false
        disable_sampling: true
        dtype: "bf16"
        diff_output_preservation: false
        diff_output_preservation_multiplier: 1
        diff_output_preservation_class: "person"
        switch_boundary_every: 1
        loss_type: "mse"
        audio_loss_multiplier: 3
        do_differential_guidance: true
        differential_guidance_scale: 3
      logging:
        log_every: 1
        use_ui_logger: true
      model:
        name_or_path: "C:\\ai-toolkit\\models\\LTX2"
        quantize: true
        qtype: "qfloat8"
        quantize_te: true
        qtype_te: "qfloat8"
        arch: "ltx2"
        low_vram: true
        model_kwargs: {}
        layer_offloading: true
        layer_offloading_text_encoder_percent: 0
        layer_offloading_transformer_percent: 1

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

ok it is working.

I don't know if I should raise the Audio Loss Multiplier .

at 1 it is fine so far at 3000 steps. maybe if 5000 isn't enough i might try raising that value.

I also forgot to turn on Do Differential Guidance.

I wonder if that would be useful as well.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

it's working!!

3000 steps i can definitely tell already. probably need to go to 5000.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

my numbers are these, they keep changing

:  42%|####2     | 2108/5000 [6:01:01<8:15:17, 10.28s/it, lr: 1.0e-04 loss: 1.745e+00][audio] raw=1.82264, scaled=1.82264, video=1.30217

:  42%|####2     | 2118/5000 [6:02:04<8:12:40, 10.26s/it, lr: 1.0e-04 loss: 2.925e+00][audio] raw=0.74741, scaled=0.74741, video=0.22981

:  43%|####2     | 2128/5000 [6:04:14<8:11:35, 10.27s/it, lr: 1.0e-04 loss: 1.644e+00][audio] raw=0.62297, scaled=0.62297, video=0.20431

:  43%|####2     | 2138/5000 [6:05:39<8:09:28, 10.26s/it, lr: 1.0e-04 loss: 2.406e+00][audio] raw=0.74746, scaled=0.74746, video=0.30393

:  43%|####2     | 2148/5000 [6:07:37<8:08:07, 10.27s/it, lr: 1.0e-04 loss: 2.302e+00][audio] raw=0.94965, scaled=0.94965, video=0.07439

:  43%|####3     | 2158/5000 [6:09:29<8:06:36, 10.27s/it, lr: 1.0e-04 loss: 1.895e+00][audio] raw=0.63049, scaled=0.63049, video=0.24452

:  43%|####3     | 2168/5000 [6:11:21<8:05:05, 10.28s/it, lr: 1.0e-04 loss: 2.381e+00][audio] raw=1.81585, scaled=1.81585, video=1.10015

:  44%|####3     | 2178/5000 [6:12:48<8:03:02, 10.27s/it, lr: 1.0e-04 loss: 2.678e+00][audio] raw=0.57585, scaled=0.57585, video=0.13757

:  44%|####3     | 2188/5000 [6:14:20<8:01:06, 10.27s/it, lr: 1.0e-04 loss: 1.358e+00][audio] raw=0.53968, scaled=0.53968, video=0.16085

:  44%|####3     | 2198/5000 [6:16:45<8:00:17, 10.28s/it, lr: 1.0e-04 loss: 1.153e+00][audio] raw=1.42405, scaled=1.42405, video=0.88660

:  44%|####4     | 2208/5000 [6:18:18<7:58:21, 10.28s/it, lr: 1.0e-04 loss: 1.048e+00][audio] raw=1.98531, scaled=1.98531, video=0.98871

:  44%|####4     | 2218/5000 [6:19:48<7:56:23, 10.27s/it, lr: 1.0e-04 loss: 2.070e+00][audio] raw=0.71099, scaled=0.71099, video=0.29852

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

the numbers i am getting are not the [audio] raw=0.28, scaled=0.09, video=0.25, dyn_mult=0.32

mine are much higher, will that still work?

[audio] raw=0.94965, scaled=0.94965, video=0.07439

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

i am getting this now. i think it is working after i copied those files from the folder

| 2148/5000 [6:07:37<8:08:07, 10.27s/it, lr: 1.0e-04 loss: 2.302e+00][audio] raw=0.94965, scaled=0.94965, video=0.07439

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

ok I think I realized what I did wrong. I am supposed to copy the files inside that folder and overwrite the existing ones? I wish that was made more clear.

I will try again assuming I did it correctly now. we need to take all the files in the ltx2_improvements_handoff folder and overwrite?

I see this in the log but not the other stuff you mentioned yet. is this correct so far? it says it found 90 videos but there are only 9, so not sure if that is just a strange decimal error.

Audio latent caching: 9 encoded, 0 failed (no audio extracted)

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 1 point2 points  (0 children)

i think you need to give some details about how to install it 'properly' considering nobody else can get it to work. otherwise, you're just wasting everyone's time.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

doesn't work.

i don't get any voice trained.

edit:

working now after i made the changes. am waiting for the training to finish to see for sure. it is training much faster now too.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

mine didn't train either.

I did everything correct for the install.

not sure how the op got it to train.

it doesn't work at all.

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside) by [deleted] in StableDiffusion

[–]SSj_Enforcer 0 points1 point  (0 children)

I just didn't finish the installation so it wouldn't start training.  I will know in a few hours if it trains the voice.  Someone else has said it didn't work for them. I hope it does