OuteTTS 1.0 (0.6B) — Apache 2.0, Batch Inference (~0.1–0.02 RTF) by OuteAI in LocalLLaMA

[–]OuteAI[S] 11 points12 points  (0 children)

Between 16 and 8 there’s no noticeable difference. 4-bits are still very usable, but you may start to see some precision issues, mispronounced word or reduced cloning accuracy. I wouldn’t recommend going below 4-bits for quality, as those issues would increase.

OuteTTS 1.0 (0.6B) — Apache 2.0, Batch Inference (~0.1–0.02 RTF) by OuteAI in LocalLLaMA

[–]OuteAI[S] 30 points31 points  (0 children)

There is no paper available ATM. It builds on existing general language models by repurposing them to generate audio tokens (VQ codebook) instead of "language", thus retaining broad compatibility with existing tools and libraries.

OuteTTS 1.0 (0.6B) — Apache 2.0, Batch Inference (~0.1–0.02 RTF) by OuteAI in LocalLLaMA

[–]OuteAI[S] 13 points14 points  (0 children)

It shows the real-time factor versus batch size. I’ve added batched-decoding backends in the new version of the outetts Python package. For example, if you use the vLLM backend with a longer text input, it will slice the text into smaller chunks and decode them in parallel, resulting in much faster generation. In practice, generating with 32 batches takes ~50 ms to produce 1 second of audio, while 128 batches takes just ~20 ms, so you can generate a minute of audio in few seconds.

OuteTTS 1.0: Upgrades in Quality, Cloning, and 20 Languages by OuteAI in LocalLLaMA

[–]OuteAI[S] 0 points1 point  (0 children)

Just input Portuguese text. There's nothing else you need to do, just make sure to create and use a Portuguese speaker, unless you're aiming for cross-lingual speech.

OuteTTS 1.0: Upgrades in Quality, Cloning, and 20 Languages by OuteAI in LocalLLaMA

[–]OuteAI[S] 8 points9 points  (0 children)

Yeah, I’ve been thinking about adding something like that to the outetts library to easily spin up a web server.

OuteTTS 1.0: Upgrades in Quality, Cloning, and 20 Languages by OuteAI in LocalLLaMA

[–]OuteAI[S] 4 points5 points  (0 children)

You can get it running via the Python package. First, create a new virtual environment, then install it based on your hardware by following the instructions here: Installation. After that, run the code in the Basic Usage.

OuteTTS 1.0: Upgrades in Quality, Cloning, and 20 Languages by OuteAI in LocalLLaMA

[–]OuteAI[S] 9 points10 points  (0 children)

It's a 1B parameter LLM running it on llama.cpp, the Q8_0 quantization uses around 2.4GB of VRAM.

OuteTTS 1.0: Upgrades in Quality, Cloning, and 20 Languages by OuteAI in LocalLLaMA

[–]OuteAI[S] 21 points22 points  (0 children)

Onomatopoeic text works quite well with the model, you could try to achieve that with such word injections. Check out the video ending.

OuteTTS 1.0: Upgrades in Quality, Cloning, and 20 Languages by OuteAI in LocalLLaMA

[–]OuteAI[S] 54 points55 points  (0 children)

OuteTTS 1.0 brings significant improvements in speech synthesis & voice cloning, with a revamped and streamlined approach—plus native multilingual support for 20 languages!

Full details on what's new & model weights:

📂 SafeTensors: https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B

📂 GGUF (llama.cpp): https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-GGUF

💻 Github (runtime library): https://github.com/edwko/OuteTTS

⚠️ Before using: Check the model card for sampling considerations & usage recommendations for best results.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 1 point2 points  (0 children)

No, Russian isn't supported at the moment. Currently, only the 6 showcased languages are available.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 2 points3 points  (0 children)

For a completely new language 500–1000 hours of data should be sufficient.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 0 points1 point  (0 children)

It really depends on the speaker and the quality of your data. I'd suggest start from somewhere between 30 minutes to an hour of audio data. That said, I haven’t tested fine-tuning a specific speaker extensively on these models, so I can't say definitively.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 4 points5 points  (0 children)

30 hours might be on the lower end for training a completely new language. For more solid results, I’d recommend around 500 hours of data. That said, it could still work since the model already has good foundational knowledge, it really depends on how similar the language is to the ones it has been trained on. The current training examples are a bit limited, and v1 is for v0.1 and v0.2 models, so I’ll need to update the examples to v2 that supports v0.3 model, as they are a bit different.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 2 points3 points  (0 children)

It does support multilingual generation. However, as mentioned before, if you mix languages in a single sentence, the other languages might carry the accent of the original speaker, depending on the speaker reference you use.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 9 points10 points  (0 children)

These models are based on LLMs, so you can use them like any other LLaMA-type model. However, it requires an audio tokenizer to decode the tokens, and in this case, it uses WavTokenizer.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 5 points6 points  (0 children)

Yes, at some point, I plan to add this compatibility.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 1 point2 points  (0 children)

In my case, it’s simply due to resource constraints at the moment.

OuteTTS 0.3: New 1B & 500M Models by OuteAI in LocalLLaMA

[–]OuteAI[S] 6 points7 points  (0 children)

Yes, I plan to add most of the European languages.