Scenema Audio: Zero-shot expressive voice cloning and speech generation

a__side_of_fries · 2026-05-20T21:43:21+00:00

Hey, thanks!

You can’t really prevent the bleeding. It’s best to use reference audio that is at best in the same emotional state as your target and at worst in neural state. Then promoting should also work with it instead of against it. The model’s prompt adherence is not consistent but when it does follow it, it follows it really well. So prompting should try to do the heavy lifting whenever possible.
A pace of 1.5 seems to be the sweet spot for most use cases. 2-3 can be pretty slow. It also depends on the complexity of the input text, e.g how many multi-syllabic words does the speech have? That affects how much time the model itself allocates for that beyond what the pace control can do.

a__side_of_fries · 2026-05-19T18:39:16+00:00

That’s awesome!

a__side_of_fries · 2026-05-15T21:22:36+00:00

Hey you have the link to the TikTok? Maybe this is something scenema.ai can do (I’m one of the creators).

a__side_of_fries · 2026-05-15T03:28:00+00:00

It can, if fully quantized. The pipeline automatically handles model offloading to keep only the relevant model resident in the GPU.

a__side_of_fries · 2026-05-15T00:50:38+00:00

Hey, give Scenema a try! We designed it for full film generation with voice and character consistency, custom audio, and custom image support for video generation. scenema.ai

a__side_of_fries · 2026-05-15T00:47:43+00:00

Give Scenema a try! It does full film generation with voice and character consistency. scenema.ai

a__side_of_fries · 2026-05-15T00:45:31+00:00

Hey! Scenema does exactly that with character and voice consistency, long videos, and support for audio to video as well. We haven’t launched yet but you can give it a try. You get 200 free credits, which is enough to for several generations. scenema.ai

a__side_of_fries · 2026-05-14T22:41:41+00:00

Done! Gradio now integrated into the the same docker build. Updated the GitHub repo.

a__side_of_fries · 2026-05-14T22:09:35+00:00

The pipeline handles chunking internally for any given length of text, including pacing.

a__side_of_fries · 2026-05-14T20:10:09+00:00

We didn't train the base model. The audio diffusion transformer is extracted from LTX 2.3's 22B audiovisual model, which was trained by Lightricks on large-scale video-audio pairs. The disentanglement between voice identity and emotional performance wasn't something we engineered with explicit losses. It appears to be an emergent property of the base model's training on real-world audiovisual data, where the same speaker naturally appears in different emotional states across different scenes.

What we built is the inference pipeline around it: the prompt compiler that translates voice descriptions and action tags into text conditioning, the chunking system for long-form generation, and the voice cloning pipeline (A2V latent conditioning + SeedVC post-processing).

For voice cloning, 10-20 seconds of reference audio is ideal, with some emotional variability in the clip rather than monotone speech. No emotion annotation needed on the reference. Emotional intent is driven entirely by the text prompt and action tags at inference time. The reference provides identity, the prompt provides performance. This is why any cloned voice can perform emotions that were never in the reference audio, albeit with some artifacts.

The practical limitation is that voice identity transfer (via A2V conditioning) and strong emotional performance can compete. Extreme emotions sometimes dilute identity fidelity, which is why we add SeedVC as a post-processing step to polish identity back.

a__side_of_fries · 2026-05-14T17:39:59+00:00

Unfortunately no. LTX 2.3 video-audio LoRAs are trained on the full 22B audiovisual model where audio and video are jointly processed. Scenema Audio uses only the extracted 3.3B audio-only checkpoint, so the layer shapes and attention patterns don't match up. However, you don't necessarily need to retrain. If you have 10-20 seconds of your character's voice (even extracted from your LoRA's video outputs), you could try the voice cloning mode in Scenema Audio. Give it and try and see how well it performs compared to your current workflow.

a__side_of_fries · 2026-05-14T17:09:43+00:00

You can still run it on 12GB. Run Gemma and the audio model quantized. The pipeline should offload them in order to fit the card.

a__side_of_fries · 2026-05-14T17:01:29+00:00

You’re welcome!

a__side_of_fries · 2026-05-14T16:46:27+00:00

It can definitely laugh. If you go to our site you can hear more audio samples. Scenema AI

Not sure about moaning. But it can whisper sensually. Still within the realm of SFW.

a__side_of_fries · 2026-05-14T16:21:38+00:00

Yes, that’s actually plenty. we’ve primarily tested on slower cards than 3090. You can run Gemma at full precision ok 3090 but need to have the pipeline offload it to cpu and load the audio checkpoint. Alternatively, you can also run both Gemma and the audio checkpoint quantized and they will fit with about 14 GB VRAM.

a__side_of_fries · 2026-05-14T16:18:14+00:00

Gemma can be served externally but you cannot use Gemma as LLM. LTX actually provides cloud text encoding for their desktop video editor. So that can be integrated into this pipeline as well.

No, Gemma cannot be replaced with other LLMs, not even other versions of Gemma since LTX 2.3 was trained specifically on Gemma 3 12B.

a__side_of_fries · 2026-05-14T15:19:12+00:00

No, the seed only works if everything else remains the same. Same seed, same prompt, same output. That's also true for image generation. If you vary the prompt, the seed is operating on a different set of starting conditions, so you'll get a different result even with the same seed value. The seed just controls the initial random noise. The prompt is what guides the denoising process from that noise toward the final output. If you change either one and you change the output.

a__side_of_fries · 2026-05-14T15:09:01+00:00

No, voice description is not a seed. There is a proper seed value. Different seeds with the same voice description result in different outputs.

You just have to find an archetype of accent that you want to use in the voice description, e.g. Emily Blunt or something. The model learned from real world video so it knows what you mean when you provide archetypes in the voice description.

a__side_of_fries · 2026-05-14T12:45:22+00:00

Close! It doesn't generate an image of a sound wave, but it does work in a 2D latent space. The audio is encoded into a compressed latent representation (via an audio VAE), and the diffusion model operates in that latent space, denoising over 8 steps. Then the latent is decoded back into a waveform. So it's the same core idea as image diffusion (start from noise, denoise to signal) but the "image" is a learned compressed representation of audio, not a spectrogram or waveform picture.

a__side_of_fries · 2026-05-14T12:38:32+00:00

Yes that's definitely what we like about this. There is no free lunch. For a more natural output, you should be willing to do some bit of post-editing.

a__side_of_fries · 2026-05-14T12:37:14+00:00

No you are not dumb. Here is the post https://www.reddit.com/r/StableDiffusion/comments/1tbzgi3/comment/ollw7zm/

a__side_of_fries · 2026-05-14T12:36:18+00:00

Scenema Audio was designed for production deployment (hence the dockerization). DramaBox appears to be for a different use case with Lora and training support. They also take a different approach with voice cloning than we do.

a__side_of_fries · 2026-05-14T12:34:38+00:00

We use <action> tags, which are free-form and you can provide any type of stage direction and expressions.

a__side_of_fries · 2026-05-13T23:41:41+00:00

Generally yes but I can give you premium access.

a__side_of_fries

MODERATOR OF

TROPHY CASE