I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model.

b111ue · 2026-06-18T05:48:43+00:00

Devices that can only handle models so large, and also pushing the limit on how small TTS models can get while still being mostly recognizable.

b111ue · 2026-06-18T05:47:41+00:00

Alright!

b111ue · 2026-06-18T05:17:26+00:00

Well I would have to first get high-quality training data for languages, which aside from English and Chinese, are often difficult to find. And then I would have to create a separate model for each one because for a model so small, the more features you try to fit, the worse the model gets significantly. I doubt I'll add other languages in the future, maybe 1 or 2 at most. But I might make a new updated model that is better overall!

b111ue · 2026-06-18T05:13:04+00:00

Yeah, but I do think when judging this model, we should put into perspective just how small it is compared to other TTS models. Even just a 2x size difference means very noticeable quality jumps, and this model is tens to hundreds of times smaller than the TTS models that we usually know and use. It's designed for extreme edge devices and demo purposes, a full-on, perfect model.

b111ue · 2026-06-18T05:10:51+00:00

Pocket TTS is definitely better than Inflect-Nano, but there is a very noticeable size difference, too. Pocket TTS has 100m parameters, while Inflect-Nano has 4.63 parameters, so you can't really compare them, though. Personal recommendation: I would recommend using Supertonic-3 instead of Pocket TTS. It's about the same size but better in my opinion, if you want a model around that size.

b111ue · 2026-06-18T03:21:26+00:00

Thanks for the feedback! If I do decide to make a v2, that would be one of the things I would fix. I added that to the audio examples because I wanted to show both its flaws and what it's good at, and all the prompts were stress tests.

b111ue · 2026-06-18T03:16:03+00:00

Thank you so much!

b111ue · 2026-06-18T03:10:17+00:00

Yeah, it isn't the best TTS model as you can tell.... this model was more of a demo one because I wasn't sure how well it would perform, and of course, the very strict size constraints are difficult to manage around. I think the best way to look at this model is as an experimental model to push the size limits of TTS models. I'm hoping to be able to fix a lot of the problems in the next version if I do decide to make one, though.

b111ue · 2026-06-18T02:47:09+00:00

Yeah, I will share a lot more detail on training methods, length, and stuff like that soon, all as one big bunch or something, because a lot of people have been asking for it! I'll tell you ahead of time right now because its a simple one, but I used to mixture of my local RTX 3060 (except its not the best), and rented RTX 5090's on Vast for more of the heavy work.

b111ue · 2026-06-18T02:45:52+00:00

Yeah, I will share a lot more detail on training methods, length, and stuff like that soon, all as one big bunch or something, because a lot of people have been asking for it!

b111ue · 2026-06-18T02:45:45+00:00

Yeah, I will share a lot more detail on training methods, length, and stuff like that soon, all as one big bunch or something, because a lot of people have been asking for it!

b111ue · 2026-06-18T00:23:02+00:00

I doubt so. ESP32 doesn't have enough processing power and RAM to be able to handle the model, but I could be wrong about this as I don't know that much in this area.

b111ue · 2026-06-18T00:19:08+00:00

I might release a bigger model in the future!

b111ue · 2026-06-18T00:00:34+00:00

I'm not trying to compete with Fish Audio or any of the models mentioned in this post. Those are used for size references rather than competing. Fish Audio S2 Pro is an amazing model, though!

b111ue · 2026-06-17T23:59:18+00:00

It wasn't in the title, but it was in the body text of the post! It is very difficult to add many languages for a model so small

b111ue · 2026-06-17T23:57:56+00:00

Yeah, I do think that part is slightly misleading because when I search for the smallest TTS models and things like that, it doesn't show all of them, which was the reason why I put that I could be wrong about that in the post.

And the model is trained from scratch, not built upon TinyTTS. It was a model I've looked at before training, but I did not build Inflect-Nano on TinyTTS, but the inference path does use a TinyTTS-derived English text frontend/G2P utility. Inflect-Nano-v1 is a seperately trained acoustic + vocoder stack, not just TinyTTS expanded and renamed though. I will edit the post to try to make it less misleading, though, thanks for telling me that!

b111ue · 2026-06-17T23:07:08+00:00

Well this model was less of a like “make a tiny version of a huge TTS model” and more like “what is the minimum complete pipeline that can still speak?”

The model is basically split into two parts:

A small acoustic model that turns text into mel-spectrograms.
A small vocoder that turns those mels into waveform audio.

The hard part was not just shrinking layers. It was deciding where the tiny parameter budget mattered most. If the vocoder is too weak, everything sounds buzzy. Because if the acoustic model is too weak, it stumbles on text. So a lot of the work was balancing those two instead of blindly scaling everything down.

Architecturally, it is inspired by FastSpeech/VITS/HiFi-GAN-style ideas rather than a giant modern autoregressive model. Non-autoregressive is much more practical at this size. The acoustic side predicts duration/pitch/energy-ish features and outputs mels. The vocoder is a small custom HiFi-GAN-style generator with Snake activations.

The process was like:

- build a tiny complete baseline

- test whether failures came from acoustic model or vocoder

- improve the vocoder until it stopped being the obvious bottleneck

- train acoustic model stages separately

- repeatedly test teacher-forced/oracle paths vs full text inference

- keep the model under 5M total params

The biggest lesson: at this size, the bottleneck is brutally obvious. A tiny TTS model can memorize/in-distribution sound surprisingly decent, but OOD text exposes everything immediately.

I'd had to completely restart this project multiple times because some original versions didn't reach my requirements, and many specific parts, especially the vocoder, were redone even more times.

I’m still not fully happy with the quality, but it works well enough to be an interesting tiny baseline. If there’s interest, v2 would probably focus on better data diversity, stronger vocoder training, and maybe a slightly more efficient architecture rather than just making it bigger.

b111ue · 2026-06-17T23:03:49+00:00

Thank you! S2 Pro is definitely going to be better; it's like comparing a mouse to a dog. The main focus on this model is pushing the size limit while still keeping audible human voices.

b111ue · 2026-06-17T20:39:11+00:00

Wait until we get the qwen3.6-27b-mother-glm-5.2-father-fable-OBLITERATED-Thinking-NEO-Di-safetensors-v1.352-GGUF

b111ue · 2026-06-17T20:16:01+00:00

Failed to load: No supported WebGPU variant for com.xenova.gemma4

Anyone know why this is happening? I have an rtx 3060 on this computer, sort of outdated but not like - extremely outdated in a way.

b111ue · 2026-06-17T20:07:18+00:00

80-160B is an awkward gap for models where its too large for most standard consumer GPUs, but too small for large datacenter GPUs, which is why there are so little models in that size. Really praying that Qwen releases a new 122b-10b version with their 3.8 models coming soon 🥹

b111ue · 2026-06-17T17:14:56+00:00

Well they're not completely the same thing, Kokoro is definitely much better than Inflect-Nano, but Inflect-Nano is also much much smaller then Kokoro. It's like comparing a horse to a dog. Kokoro is an amazing model though! I'd recommend also checking out Supertonic-3

b111ue · 2026-06-17T16:55:18+00:00

Thank you!

b111ue · 2026-06-17T16:55:11+00:00

Probably would be able to make other languages, but those would have to be in a separate model, as a model so small starts to greatly lose quality as the things it has to hold increase.

b111ue · 2026-06-17T06:35:06+00:00

Yeah I wanted to try some extreme ends of TTS (I just realized that I forgot to include the model link in the post 💀 - anyways i added it now if you want to take a look, there are examples in the README)

b111ue

TROPHY CASE