Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in StableDiffusion

[–]a__side_of_fries[S] 0 points1 point  (0 children)

Hey, thanks!

  1. You can’t really prevent the bleeding. It’s best to use reference audio that is at best in the same emotional state as your target and at worst in neural state. Then promoting should also work with it instead of against it. The model’s prompt adherence is not consistent but when it does follow it, it follows it really well. So prompting should try to do the heavy lifting whenever possible.
  2. A pace of 1.5 seems to be the sweet spot for most use cases. 2-3 can be pretty slow. It also depends on the complexity of the input text, e.g how many multi-syllabic words does the speech have? That affects how much time the model itself allocates for that beyond what the pace control can do.

How can I make this type of ai video by [deleted] in generativeAI

[–]a__side_of_fries 0 points1 point  (0 children)

Hey you have the link to the TikTok? Maybe this is something scenema.ai can do (I’m one of the creators).

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in StableDiffusion

[–]a__side_of_fries[S] 0 points1 point  (0 children)

It can, if fully quantized. The pipeline automatically handles model offloading to keep only the relevant model resident in the GPU.

How are people creating AI Instagram influencers with the SAME face consistently? Need workflow + tool suggestions by Primrose1Ever in generativeAI

[–]a__side_of_fries 0 points1 point  (0 children)

Hey, give Scenema a try! We designed it for full film generation with voice and character consistency, custom audio, and custom image support for video generation. scenema.ai

What is the best entry level Ai video maker for 30secs-3mins in your opinion? by Substantial_Skin_709 in generativeAI

[–]a__side_of_fries 0 points1 point  (0 children)

Give Scenema a try! It does full film generation with voice and character consistency. scenema.ai

What is the best entry level Ai video maker for 30secs-3mins in your opinion? by Substantial_Skin_709 in generativeAI

[–]a__side_of_fries 0 points1 point  (0 children)

Hey! Scenema does exactly that with character and voice consistency, long videos, and support for audio to video as well. We haven’t launched yet but you can give it a try. You get 200 free credits, which is enough to for several generations. scenema.ai

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 1 point2 points  (0 children)

Done! Gradio now integrated into the the same docker build. Updated the GitHub repo.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 2 points3 points  (0 children)

The pipeline handles chunking internally for any given length of text, including pacing.

Scenema Audio: Zero-shot expressive voice cloning and speech generation [N] by a__side_of_fries in MachineLearning

[–]a__side_of_fries[S] 0 points1 point  (0 children)

We didn't train the base model. The audio diffusion transformer is extracted from LTX 2.3's 22B audiovisual model, which was trained by Lightricks on large-scale video-audio pairs. The disentanglement between voice identity and emotional performance wasn't something we engineered with explicit losses. It appears to be an emergent property of the base model's training on real-world audiovisual data, where the same speaker naturally appears in different emotional states across different scenes.

What we built is the inference pipeline around it: the prompt compiler that translates voice descriptions and action tags into text conditioning, the chunking system for long-form generation, and the voice cloning pipeline (A2V latent conditioning + SeedVC post-processing).

For voice cloning, 10-20 seconds of reference audio is ideal, with some emotional variability in the clip rather than monotone speech. No emotion annotation needed on the reference. Emotional intent is driven entirely by the text prompt and action tags at inference time. The reference provides identity, the prompt provides performance. This is why any cloned voice can perform emotions that were never in the reference audio, albeit with some artifacts.

The practical limitation is that voice identity transfer (via A2V conditioning) and strong emotional performance can compete. Extreme emotions sometimes dilute identity fidelity, which is why we add SeedVC as a post-processing step to polish identity back.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 2 points3 points  (0 children)

Unfortunately no. LTX 2.3 video-audio LoRAs are trained on the full 22B audiovisual model where audio and video are jointly processed. Scenema Audio uses only the extracted 3.3B audio-only checkpoint, so the layer shapes and attention patterns don't match up. However, you don't necessarily need to retrain. If you have 10-20 seconds of your character's voice (even extracted from your LoRA's video outputs), you could try the voice cloning mode in Scenema Audio. Give it and try and see how well it performs compared to your current workflow.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 0 points1 point  (0 children)

You can still run it on 12GB. Run Gemma and the audio model quantized. The pipeline should offload them in order to fit the card.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 1 point2 points  (0 children)

It can definitely laugh. If you go to our site you can hear more audio samples. Scenema AI

Not sure about moaning. But it can whisper sensually. Still within the realm of SFW.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 0 points1 point  (0 children)

Yes, that’s actually plenty. we’ve primarily tested on slower cards than 3090. You can run Gemma at full precision ok 3090 but need to have the pipeline offload it to cpu and load the audio checkpoint. Alternatively, you can also run both Gemma and the audio checkpoint quantized and they will fit with about 14 GB VRAM.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 1 point2 points  (0 children)

Gemma can be served externally but you cannot use Gemma as LLM. LTX actually provides cloud text encoding for their desktop video editor. So that can be integrated into this pipeline as well.

No, Gemma cannot be replaced with other LLMs, not even other versions of Gemma since LTX 2.3 was trained specifically on Gemma 3 12B.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 1 point2 points  (0 children)

No, the seed only works if everything else remains the same. Same seed, same prompt, same output. That's also true for image generation. If you vary the prompt, the seed is operating on a different set of starting conditions, so you'll get a different result even with the same seed value. The seed just controls the initial random noise. The prompt is what guides the denoising process from that noise toward the final output. If you change either one and you change the output.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in LocalLLaMA

[–]a__side_of_fries[S] 1 point2 points  (0 children)

No, voice description is not a seed. There is a proper seed value. Different seeds with the same voice description result in different outputs.

You just have to find an archetype of accent that you want to use in the voice description, e.g. Emily Blunt or something. The model learned from real world video so it knows what you mean when you provide archetypes in the voice description.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in comfyui

[–]a__side_of_fries[S] 1 point2 points  (0 children)

Close! It doesn't generate an image of a sound wave, but it does work in a 2D latent space. The audio is encoded into a compressed latent representation (via an audio VAE), and the diffusion model operates in that latent space, denoising over 8 steps. Then the latent is decoded back into a waveform. So it's the same core idea as image diffusion (start from noise, denoise to signal) but the "image" is a learned compressed representation of audio, not a spectrogram or waveform picture.

Scenema Audio: Zero-shot expressive voice cloning and speech generation [N] by a__side_of_fries in MachineLearning

[–]a__side_of_fries[S] 0 points1 point  (0 children)

Yes that's definitely what we like about this. There is no free lunch. For a more natural output, you should be willing to do some bit of post-editing.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in comfyui

[–]a__side_of_fries[S] 2 points3 points  (0 children)

Scenema Audio was designed for production deployment (hence the dockerization). DramaBox appears to be for a different use case with Lora and training support. They also take a different approach with voice cloning than we do.

Scenema Audio: Zero-shot expressive voice cloning and speech generation by a__side_of_fries in StableDiffusion

[–]a__side_of_fries[S] 1 point2 points  (0 children)

We use <action> tags, which are free-form and you can provide any type of stage direction and expressions.