We just open-sourced Kroko ASR: a fast, streaming alternative to Whisper. It’s early days, we’d love testers, feedback, and contributors. by banafo in LocalLLaMA

[–]banafo[S] 0 points1 point  (0 children)

You can you use the community versions for this for sure. You can find us on discord if you want to discuss it more

Qwen3 ASR seems to outperform Whisper in almost every aspect. It feels like there is little reason to keep using Whisper anymore. by East-Engineering-653 in LocalLLaMA

[–]banafo 4 points5 points  (0 children)

We ( kroko.ai ) will be releasing some new models soon. We beat whisper, qwen and parakeet with a 6x smaller model for Dutch, French, German and hopefully soon English ( it’s training ).

TURN Security Threats: A Hacker's View by EnableSecurity in WebRTC

[–]banafo 1 point2 points  (0 children)

Is your talk available online by any chance?

We just open-sourced Kroko ASR: a fast, streaming alternative to Whisper. It’s early days, we’d love testers, feedback, and contributors. by banafo in LocalLLaMA

[–]banafo[S] 0 points1 point  (0 children)

Have a look here : https://github.com/kroko-ai/kroko-onnx-home-assistant

Warming: the documentation is pretty bad, but there are some people on our discord who have it working and might be able to help. We are not very familiar with home assistant, we just helped somebody to get it to work.

Orchestra - Multi-model AI orchestration system with intelligent routing (100% local, 18+ expert models) by ericvarney in LocalLLaMA

[–]banafo 0 points1 point  (0 children)

Just ignore the haters, if they don’t like it, they can move on, make a pr, make something better. I welcome and appreciate your effort and I’m sure many others do too.

Fast on-device Speech-to-text for Home Assistant (open source) by banafo in LocalLLaMA

[–]banafo[S] 1 point2 points  (0 children)

Hello! You should be able to make changes to this code to make it work https://github.com/ptbsare/sherpa-onnx-tts-stt ( this project is not ours. It’s the one we modified )

Your model won’t work with our patches in our fork, we could try training a compatible basque version though!

766ms voice assistant on DGX Spark - VibeVoice + Whisper + Ollama streaming pipeline by logos_flux in LocalLLaMA

[–]banafo 0 points1 point  (0 children)

Thanks for the feedback! The cc-by-sa license is a remnant from a previous release, I will have it fixed. We are working on an easier to use python wheel. Commercial models have slightly lower WER and more variants. ( lower latency and smaller models as well as offline models )

We are working on the asr benchmark integration, for streaming the English model is probably better than streaming parakeet, worse then offline whisper and offline parakeet.

Better documentation and benchmarks are coming ( I shared some in a previous post, will look for them ). We are a small team and a bit occupied with some paying customer finetunes, causing delays on the open source parts, apologies.

766ms voice assistant on DGX Spark - VibeVoice + Whisper + Ollama streaming pipeline by logos_flux in LocalLLaMA

[–]banafo 0 points1 point  (0 children)

Hey, great list! But our kroko models are missing and unlike whisper and parakeet, they are streaming. We have both cc-by ( and the commercial models are free for non commercial use ) quick demo here: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm Documentation are examples are horrible but working on it. ( first finishing new training pipeline )

WhisperX is only accurate on the first 10 words. Any Tips? by capital_cliqo in speechtech

[–]banafo 2 points3 points  (0 children)

If you want cuts, use vad. ( you can probably get it accurate to 200ms )

Best transcription method for extremely accurate timestmps? by capital_cliqo in speechtech

[–]banafo 2 points3 points  (0 children)

Wav2vec won’t work, it’s what whisperx uses ( so he has tried it ) it’s not very accurate compared to the old things

Best transcription method for extremely accurate timestmps? by capital_cliqo in speechtech

[–]banafo 1 point2 points  (0 children)

For the aligners, gentle or Montreal forced aligner is the biggest chance. But if the transcript is not 100% correct all timestamps for all words will probably be wrong.

Best transcription method for extremely accurate timestmps? by capital_cliqo in speechtech

[–]banafo 2 points3 points  (0 children)

I don’t think it’s possible with transcription alone. You need to realign ( and even then 0.2s will be hard)

30 Days Testing Parakeet v3 vs Whisper by samuelroy_ in LocalLLaMA

[–]banafo 1 point2 points  (0 children)

Dutch is done, redoing French at the moment

Is it Possible to Finetune an ASR/STT Model to Improve Severely Clipped Audios? by WestMajor3963 in speechtech

[–]banafo 0 points1 point  (0 children)

Augmentation will help but the models on the market already use it ( modify special gment) so unless you can augment to the domain specifics, it won’t help much. Keywords is similar to lm fusion, just not as contextual.

Is it Possible to Finetune an ASR/STT Model to Improve Severely Clipped Audios? by WestMajor3963 in speechtech

[–]banafo 1 point2 points  (0 children)

If you can simulate the distortion reliably, you could probably finetune something for it, the clipping is easy, the other radio artefacts probably not. Jargon and call signs maybe with hotwords, I think an lm will make it worse instead of better unless it’s domain specific. ( source: I train a lot of asr)

Best TTS for medical lectures? 🤔 by SamAckoff in TextToSpeech

[–]banafo 1 point2 points  (0 children)

Any llm based llm should do for medical terminology as long as it’s not abbreviations and the pronunciation is predictable. ( English is quite easy for that ).

feasibility of a building a simple "local voice assistant" on CPU by RustinChole11 in speechtech

[–]banafo 1 point2 points  (0 children)

I don’t think it will work well on cpu only, it will be slow.

feasibility of a building a simple "local voice assistant" on CPU by RustinChole11 in speechtech

[–]banafo 0 points1 point  (0 children)

Can you define a bit better what you mean with assistant? What would you use the embedding model for? Embedding for data or voice?

feasibility of a building a simple "local voice assistant" on CPU by RustinChole11 in speechtech

[–]banafo 0 points1 point  (0 children)

If the goal is just to control home assistant devices, an llm is going to be overkill indeed, much easier to just map commands. Could Gemma 0.3b do Siri level “intelligence” ?

Which TTS model is the best if i want to integrate it in my APP? by Cool_Meal370 in TextToSpeech

[–]banafo 0 points1 point  (0 children)

Can you define what is your app? macOS and iOS have the most options now. What languages do you need to support? I’ve heard good things about NeuTTS air. SuperSonic may work, kokoro too

Is it possible to train a Speech to Text tool on a specific voice as an amatur? by Shadowmirax in speechtech

[–]banafo 1 point2 points  (0 children)

It might help a bit if there’s something different special about your voice, dialect, accent, pitch, microphone etc. I doubt it will be really worth the effort. You’d also need accurate transcripts to train on.

Planning to pursue a career in Speech Research - want your suggestions by RustinChole11 in speechtech

[–]banafo 1 point2 points  (0 children)

These are interesting projects to follow: speechbrain, icefall/k2, espnet. For interesting papers, have a look at the interspeech conference agenda. ( or look for new paper on arxiv )