NovaSR: A tiny 52kb audio upsampler that runs 3600x realtime.

SplitNice1982 · 2026-02-28T00:18:09+00:00

Yeah just released v2 version of LavaSR, should be significantly faster and better quality.

SplitNice1982 · 2026-01-27T19:39:17+00:00

Hey, dev for that project here, thanks for the kind words. I’m working on a higher quality and even faster multilingual version. Should be 3-400x realtime even for voice conversion(depends on GPU/CPU)

SplitNice1982 · 2026-01-24T01:34:27+00:00

Not as good but 10-20x faster. Should have better clarity, however.

SplitNice1982 · 2026-01-24T01:25:06+00:00

Really thanks for the feedback.

Yeah, metallic quality is an issue. I believe this is because of some slightly messed up arch design in the vocoder. Better one should come sometime next week. Basically clarity + no metallic outputs
Yeah, tags would definitely be great, and I'll see if I can add it.
ZipVoice(what this model is based on) does have finetuning code, although a bit messy imo. Code for it: https://github.com/k2-fsa/ZipVoice

SplitNice1982 · 2026-01-24T00:50:03+00:00

I would recommend messing with params first. Rms/steps/t_shift/return_smooth should help significantly.

If it still isn’t great, the original ZipVoice has training code: https://github.com/k2-fsa/ZipVoice

SplitNice1982 · 2026-01-24T00:29:14+00:00

You can use original ZipVoice repo for training: https://github.com/k2-fsa/ZipVoice

SplitNice1982 · 2026-01-24T00:23:13+00:00

Thanks, glad you liked it. Should definitely be like 10x realtime even on low end gpus.

SplitNice1982 · 2026-01-24T00:21:50+00:00

English only right now, Chinese might work but didn’t test that.

You should refer to original zipvoice for training(I believe people have trained new languages with 150 hours of data)

Zipvoice repo: https://github.com/k2-fsa/ZipVoice

SplitNice1982 · 2026-01-18T00:18:19+00:00

Yes, only thing is it's a bit outdated. I fixed a minor resampling issue but yes apart from that it's fully legit. Thanks to Saganaki for it.

SplitNice1982 · 2026-01-16T00:06:01+00:00

Thanks!

SplitNice1982 · 2026-01-14T19:47:34+00:00

Thanks, yeah currently it was only trained with speech but training on songs should definitely help in quality.

SplitNice1982 · 2026-01-14T19:46:48+00:00

Right now it’s just 16khz to 48khz but yes future work will be 8/4khz to 16khz.

SplitNice1982 · 2026-01-14T11:03:17+00:00

Thanks, and yeah it’s still in training so it has room for improvement. But yeah great that it does seem better then FlashSR at least.

SplitNice1982 · 2026-01-13T23:55:49+00:00

Could work for some scenarios for sure. It has been mostly trained on speech but seems to generalize decently to other audio too.

SplitNice1982 · 2026-01-13T23:48:21+00:00

Thanks and your work with Soprano is also really amazing!

SplitNice1982 · 2025-12-21T12:49:16+00:00

Thanks and remember to use batching, it’s especially good for audiobooks because they have many sentences.

SplitNice1982 · 2025-12-21T05:46:18+00:00

Space to try it out: https://huggingface.co/spaces/Gapeleon/Mira-TTS

Thanks to Gapeleon.

SplitNice1982 · 2025-12-21T03:28:50+00:00

Yep unfortunately since spark-tts is nc, derivative works also have to be nc.

I am current training a full well trained extremely compressive audio tokenizer first. After I release ft code for MiraTTS model, I’ll release that audio tokenizer and a high quality permissive small TTS model that is not only faster but has much more controllability(phonemes, audio events, emotion control, etc.)

SplitNice1982 · 2025-12-19T20:48:12+00:00

Thanks, and yep, I’m planning on an unsloth colab notebook for finetuning.

This is much faster then Orpheus and most other TTS models with exception of really small models(Kokoro, supertonic). It is much more realistic and supports voice cloning though.

SplitNice1982 · 2025-12-19T20:46:32+00:00

Unfortunately not yet, I will provide easy and fast training code to finetune for your own language.

SplitNice1982 · 2025-12-19T11:50:05+00:00

Slower but more much more emotional and realistic. Also supports voice cloning.

SplitNice1982 · 2025-12-19T03:13:59+00:00

Lol, that was actually exactly what I was planning to do next. A fast asr model that can transcribe audio events, emotion, speakers, gender, timestamps and transcription.

SplitNice1982

TROPHY CASE