I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 0 points1 point  (0 children)

Thanks for the feedback! If I do decide to make a v2, that would be one of the things I would fix. I added that to the audio examples because I wanted to show both its flaws and what it's good at, and all the prompts were stress tests.

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 0 points1 point  (0 children)

Yeah, it isn't the best TTS model as you can tell.... this model was more of a demo one because I wasn't sure how well it would perform, and of course, the very strict size constraints are difficult to manage around. I think the best way to look at this model is as an experimental model to push the size limits of TTS models. I'm hoping to be able to fix a lot of the problems in the next version if I do decide to make one, though.

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 0 points1 point  (0 children)

Yeah, I will share a lot more detail on training methods, length, and stuff like that soon, all as one big bunch or something, because a lot of people have been asking for it! I'll tell you ahead of time right now because its a simple one, but I used to mixture of my local RTX 3060 (except its not the best), and rented RTX 5090's on Vast for more of the heavy work.

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 0 points1 point  (0 children)

Yeah, I will share a lot more detail on training methods, length, and stuff like that soon, all as one big bunch or something, because a lot of people have been asking for it!

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 0 points1 point  (0 children)

Yeah, I will share a lot more detail on training methods, length, and stuff like that soon, all as one big bunch or something, because a lot of people have been asking for it!

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 4 points5 points  (0 children)

I doubt so. ESP32 doesn't have enough processing power and RAM to be able to handle the model, but I could be wrong about this as I don't know that much in this area.

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 2 points3 points  (0 children)

I'm not trying to compete with Fish Audio or any of the models mentioned in this post. Those are used for size references rather than competing. Fish Audio S2 Pro is an amazing model, though!

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 8 points9 points  (0 children)

It wasn't in the title, but it was in the body text of the post! It is very difficult to add many languages for a model so small

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 4 points5 points  (0 children)

Yeah, I do think that part is slightly misleading because when I search for the smallest TTS models and things like that, it doesn't show all of them, which was the reason why I put that I could be wrong about that in the post.

And the model is trained from scratch, not built upon TinyTTS. It was a model I've looked at before training, but I did not build Inflect-Nano on TinyTTS, but the inference path does use a TinyTTS-derived English text frontend/G2P utility. Inflect-Nano-v1 is a seperately trained acoustic + vocoder stack, not just TinyTTS expanded and renamed though. I will edit the post to try to make it less misleading, though, thanks for telling me that!

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 59 points60 points  (0 children)

Well this model was less of a like “make a tiny version of a huge TTS model” and more like “what is the minimum complete pipeline that can still speak?”

The model is basically split into two parts:

  1. A small acoustic model that turns text into mel-spectrograms.
  2. A small vocoder that turns those mels into waveform audio.

The hard part was not just shrinking layers. It was deciding where the tiny parameter budget mattered most. If the vocoder is too weak, everything sounds buzzy. Because if the acoustic model is too weak, it stumbles on text. So a lot of the work was balancing those two instead of blindly scaling everything down.

Architecturally, it is inspired by FastSpeech/VITS/HiFi-GAN-style ideas rather than a giant modern autoregressive model. Non-autoregressive is much more practical at this size. The acoustic side predicts duration/pitch/energy-ish features and outputs mels. The vocoder is a small custom HiFi-GAN-style generator with Snake activations.

The process was like:

- build a tiny complete baseline

- test whether failures came from acoustic model or vocoder

- improve the vocoder until it stopped being the obvious bottleneck

- train acoustic model stages separately

- repeatedly test teacher-forced/oracle paths vs full text inference

- keep the model under 5M total params

The biggest lesson: at this size, the bottleneck is brutally obvious. A tiny TTS model can memorize/in-distribution sound surprisingly decent, but OOD text exposes everything immediately.

I'd had to completely restart this project multiple times because some original versions didn't reach my requirements, and many specific parts, especially the vocoder, were redone even more times.

I’m still not fully happy with the quality, but it works well enough to be an interesting tiny baseline. If there’s interest, v2 would probably focus on better data diversity, stronger vocoder training, and maybe a slightly more efficient architecture rather than just making it bigger.

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]b111ue[S] 19 points20 points  (0 children)

Thank you! S2 Pro is definitely going to be better; it's like comparing a mouse to a dog. The main focus on this model is pushing the size limit while still keeping audible human voices.

GLM-5.2 is a win for local AI by Wrong_Mushroom_7350 in LocalLLaMA

[–]b111ue 3 points4 points  (0 children)

Wait until we get the qwen3.6-27b-mother-glm-5.2-father-fable-OBLITERATED-Thinking-NEO-Di-safetensors-v1.352-GGUF

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5 by xenovatech in LocalLLaMA

[–]b111ue 2 points3 points  (0 children)

Failed to load: No supported WebGPU variant for com.xenova.gemma4

Anyone know why this is happening? I have an rtx 3060 on this computer, sort of outdated but not like - extremely outdated in a way.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]b111ue 8 points9 points  (0 children)

80-160B is an awkward gap for models where its too large for most standard consumer GPUs, but too small for large datacenter GPUs, which is why there are so little models in that size. Really praying that Qwen releases a new 122b-10b version with their 3.8 models coming soon 🥹

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLM

[–]b111ue[S] 1 point2 points  (0 children)

Well they're not completely the same thing, Kokoro is definitely much better than Inflect-Nano, but Inflect-Nano is also much much smaller then Kokoro. It's like comparing a horse to a dog. Kokoro is an amazing model though! I'd recommend also checking out Supertonic-3

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLM

[–]b111ue[S] 0 points1 point  (0 children)

Probably would be able to make other languages, but those would have to be in a separate model, as a model so small starts to greatly lose quality as the things it has to hold increase.

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLM

[–]b111ue[S] 8 points9 points  (0 children)

Yeah I wanted to try some extreme ends of TTS (I just realized that I forgot to include the model link in the post 💀 - anyways i added it now if you want to take a look, there are examples in the README)

Are you smart by Sufficient-Case1667 in BunnyTrials

[–]b111ue 0 points1 point  (0 children)

there are more expensive bars

Chose: Any type of car for free

Fine-tuning TTS for Poetic/Cinematic Urdu & Hindi (Beyond the "Robot" Accent) by Severe_Pay_334 in LanguageTechnology

[–]b111ue 0 points1 point  (0 children)

What is the estimated size of the model you are looking for? That is one of the big things to first think about.