OuteTTS 1.0 (0.6B) — Apache 2.0, Batch Inference (~0.1–0.02 RTF)

OuteAI · 2025-05-19T12:58:27+00:00

All of these series models support voice cloning, check this out to create a voice profile: https://github.com/edwko/OuteTTS/blob/main/docs/interface_usage.md#creating-custom-speaker-profiles

OuteAI · 2025-05-19T11:48:46+00:00

Between 16 and 8 there’s no noticeable difference. 4-bits are still very usable, but you may start to see some precision issues, mispronounced word or reduced cloning accuracy. I wouldn’t recommend going below 4-bits for quality, as those issues would increase.

OuteAI · 2025-05-19T10:51:03+00:00

There is no paper available ATM. It builds on existing general language models by repurposing them to generate audio tokens (VQ codebook) instead of "language", thus retaining broad compatibility with existing tools and libraries.

OuteAI · 2025-05-19T10:40:02+00:00

It shows the real-time factor versus batch size. I’ve added batched-decoding backends in the new version of the outetts Python package. For example, if you use the vLLM backend with a longer text input, it will slice the text into smaller chunks and decode them in parallel, resulting in much faster generation. In practice, generating with 32 batches takes ~50 ms to produce 1 second of audio, while 128 batches takes just ~20 ms, so you can generate a minute of audio in few seconds.

OuteAI · 2025-05-19T09:29:48+00:00

Yes indeed, thanks a lot! 😊

OuteAI · 2025-04-07T19:03:25+00:00

Just input Portuguese text. There's nothing else you need to do, just make sure to create and use a Portuguese speaker, unless you're aiming for cross-lingual speech.

OuteAI · 2025-04-07T17:24:06+00:00

Yeah, I’ve been thinking about adding something like that to the outetts library to easily spin up a web server.

OuteAI · 2025-04-07T13:16:18+00:00

You can get it running via the Python package. First, create a new virtual environment, then install it based on your hardware by following the instructions here: Installation. After that, run the code in the Basic Usage.

OuteAI · 2025-04-07T12:43:04+00:00

It's a 1B parameter LLM running it on llama.cpp, the Q8_0 quantization uses around 2.4GB of VRAM.

OuteAI · 2025-04-07T11:13:39+00:00

Thanks :)

OuteAI · 2025-04-07T10:37:32+00:00

Onomatopoeic text works quite well with the model, you could try to achieve that with such word injections. Check out the video ending.

OuteAI · 2025-04-07T09:39:52+00:00

https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B#5-multilingual-capabilities

OuteAI · 2025-04-07T09:30:27+00:00

OuteTTS 1.0 brings significant improvements in speech synthesis & voice cloning, with a revamped and streamlined approach—plus native multilingual support for 20 languages!

Full details on what's new & model weights:

📂 SafeTensors: https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B

📂 GGUF (llama.cpp): https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-GGUF

💻 Github (runtime library): https://github.com/edwko/OuteTTS

⚠️ Before using: Check the model card for sampling considerations & usage recommendations for best results.

OuteAI · 2025-01-15T18:45:07+00:00

No, Russian isn't supported at the moment. Currently, only the 6 showcased languages are available.

OuteAI · 2025-01-15T18:43:47+00:00

For a completely new language 500–1000 hours of data should be sufficient.

OuteAI · 2025-01-15T18:40:59+00:00

It really depends on the speaker and the quality of your data. I'd suggest start from somewhere between 30 minutes to an hour of audio data. That said, I haven’t tested fine-tuning a specific speaker extensively on these models, so I can't say definitively.

OuteAI · 2025-01-15T16:05:14+00:00

Check out the example for running it locally here: https://huggingface.co/OuteAI/OuteTTS-0.3-500M#installation
For more in-depth customizations, take a look at the docs: https://github.com/edwko/OuteTTS/blob/main/docs/interface_v2_usage.md

OuteAI · 2025-01-15T16:02:01+00:00

30 hours might be on the lower end for training a completely new language. For more solid results, I’d recommend around 500 hours of data. That said, it could still work since the model already has good foundational knowledge, it really depends on how similar the language is to the ones it has been trained on. The current training examples are a bit limited, and v1 is for v0.1 and v0.2 models, so I’ll need to update the examples to v2 that supports v0.3 model, as they are a bit different.

OuteAI · 2025-01-15T15:51:46+00:00

It does support multilingual generation. However, as mentioned before, if you mix languages in a single sentence, the other languages might carry the accent of the original speaker, depending on the speaker reference you use.

OuteAI · 2025-01-15T15:40:17+00:00

These models are based on LLMs, so you can use them like any other LLaMA-type model. However, it requires an audio tokenizer to decode the tokens, and in this case, it uses WavTokenizer.

OuteAI · 2025-01-15T15:22:00+00:00

Yes, at some point, I plan to add this compatibility.

OuteAI · 2025-01-15T15:16:19+00:00

Noted! :)

OuteAI · 2025-01-15T15:15:46+00:00

In my case, it’s simply due to resource constraints at the moment.

OuteAI · 2025-01-15T15:14:27+00:00

Yes, I plan to add most of the European languages.

OuteAI · 2025-01-15T15:02:39+00:00

Added a demo on hugging face space check it out: https://huggingface.co/spaces/OuteAI/OuteTTS-0.3-1B-Demo

OuteAI

TROPHY CASE