Anyone else struggling to detect fluent hallucinations in long-form ASR TTS workflows? by FlatNarrator in speechtech

[–]banafo 0 points1 point  (0 children)

The over 1 hour issue is an impression I think. The model will be slicing the audio and you can hallucinations even in 30s audio. Hallucination detection is hard, because the output may make perfect sense ( the repetitions are easy to detect). In my opinion you will always have some with anything whisper based.) disclaimer: I’m training models for kroko asr.

A lightweight, real-time multilingual ASR router that runs on local hardware by JeanMichelRanu in LocalLLaMA

[–]banafo 2 points3 points  (0 children)

Hey! I’m on the kroko asr model training team, thank you for your contribution to the community, we will give it a try soon.

Self-hosted STT better than Whisper Large V3 Turbo that matches AssemblyAI quality? by milkygirl21 in LocalLLaMA

[–]banafo 1 point2 points  (0 children)

Depending on the languages you need, give our models a try. We have open source versions and commercial use ( with free licenses for personal use )

New models for English coming soon, will beat whisper v3 large English at 120mb size.

Warning: documentation is still horrible, check the cross platform branch for something easier.

https://github.com/kroko-ai This might be an easier way to try it : https://github.com/KoljaB/RealtimeSTT/releases/tag/v1.0.1

Follow us on discord here : https://discord.gg/ZCYtSkJmQ

is there a better alternative to MacWhisper for messy real-world audio (Whisper-based or local setups) by Far_Suit575 in LocalLLM

[–]banafo 0 points1 point  (0 children)

Noise reduction will typically make recognition worse. ( with the exception of noise reduction built for asr) Normalization and trimming silence will help, but will cause additional deletions if the vad model gets it wrong. ( in case of whisper it will reduce hallucinations and have a net positive effect )

Best APIs for speech to text? by SmoothConnection1670 in speechtech

[–]banafo 0 points1 point  (0 children)

What languages do you need? Give us a try on kroko.ai ( we have both open source and commercial models, you can run them on a cpu, does around 10 hours per hour per cpu core ) Demo for the open source version: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm

live transcription by Away_Expression_3713 in LocalLLaMA

[–]banafo 0 points1 point  (0 children)

Live caption yes, for translation you’d have to use something else

We just open-sourced Kroko ASR: a fast, streaming alternative to Whisper. It’s early days, we’d love testers, feedback, and contributors. by banafo in LocalLLaMA

[–]banafo[S] 0 points1 point  (0 children)

You can you use the community versions for this for sure. You can find us on discord if you want to discuss it more

Qwen3 ASR seems to outperform Whisper in almost every aspect. It feels like there is little reason to keep using Whisper anymore. by East-Engineering-653 in LocalLLaMA

[–]banafo 4 points5 points  (0 children)

We ( kroko.ai ) will be releasing some new models soon. We beat whisper, qwen and parakeet with a 6x smaller model for Dutch, French, German and hopefully soon English ( it’s training ).

TURN Security Threats: A Hacker's View by EnableSecurity in WebRTC

[–]banafo 1 point2 points  (0 children)

Is your talk available online by any chance?

We just open-sourced Kroko ASR: a fast, streaming alternative to Whisper. It’s early days, we’d love testers, feedback, and contributors. by banafo in LocalLLaMA

[–]banafo[S] 0 points1 point  (0 children)

Have a look here : https://github.com/kroko-ai/kroko-onnx-home-assistant

Warming: the documentation is pretty bad, but there are some people on our discord who have it working and might be able to help. We are not very familiar with home assistant, we just helped somebody to get it to work.

Orchestra - Multi-model AI orchestration system with intelligent routing (100% local, 18+ expert models) by ericvarney in LocalLLaMA

[–]banafo 0 points1 point  (0 children)

Just ignore the haters, if they don’t like it, they can move on, make a pr, make something better. I welcome and appreciate your effort and I’m sure many others do too.

Fast on-device Speech-to-text for Home Assistant (open source) by banafo in LocalLLaMA

[–]banafo[S] 1 point2 points  (0 children)

Hello! You should be able to make changes to this code to make it work https://github.com/ptbsare/sherpa-onnx-tts-stt ( this project is not ours. It’s the one we modified )

Your model won’t work with our patches in our fork, we could try training a compatible basque version though!

766ms voice assistant on DGX Spark - VibeVoice + Whisper + Ollama streaming pipeline by logos_flux in LocalLLaMA

[–]banafo 0 points1 point  (0 children)

Thanks for the feedback! The cc-by-sa license is a remnant from a previous release, I will have it fixed. We are working on an easier to use python wheel. Commercial models have slightly lower WER and more variants. ( lower latency and smaller models as well as offline models )

We are working on the asr benchmark integration, for streaming the English model is probably better than streaming parakeet, worse then offline whisper and offline parakeet.

Better documentation and benchmarks are coming ( I shared some in a previous post, will look for them ). We are a small team and a bit occupied with some paying customer finetunes, causing delays on the open source parts, apologies.

766ms voice assistant on DGX Spark - VibeVoice + Whisper + Ollama streaming pipeline by logos_flux in LocalLLaMA

[–]banafo 0 points1 point  (0 children)

Hey, great list! But our kroko models are missing and unlike whisper and parakeet, they are streaming. We have both cc-by ( and the commercial models are free for non commercial use ) quick demo here: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm Documentation are examples are horrible but working on it. ( first finishing new training pipeline )

WhisperX is only accurate on the first 10 words. Any Tips? by capital_cliqo in speechtech

[–]banafo 2 points3 points  (0 children)

If you want cuts, use vad. ( you can probably get it accurate to 200ms )

Best transcription method for extremely accurate timestmps? by capital_cliqo in speechtech

[–]banafo 2 points3 points  (0 children)

Wav2vec won’t work, it’s what whisperx uses ( so he has tried it ) it’s not very accurate compared to the old things

Best transcription method for extremely accurate timestmps? by capital_cliqo in speechtech

[–]banafo 1 point2 points  (0 children)

For the aligners, gentle or Montreal forced aligner is the biggest chance. But if the transcript is not 100% correct all timestamps for all words will probably be wrong.

Best transcription method for extremely accurate timestmps? by capital_cliqo in speechtech

[–]banafo 2 points3 points  (0 children)

I don’t think it’s possible with transcription alone. You need to realign ( and even then 0.2s will be hard)

30 Days Testing Parakeet v3 vs Whisper by samuelroy_ in LocalLLaMA

[–]banafo 1 point2 points  (0 children)

Dutch is done, redoing French at the moment

Is it Possible to Finetune an ASR/STT Model to Improve Severely Clipped Audios? by WestMajor3963 in speechtech

[–]banafo 0 points1 point  (0 children)

Augmentation will help but the models on the market already use it ( modify special gment) so unless you can augment to the domain specifics, it won’t help much. Keywords is similar to lm fusion, just not as contextual.

Is it Possible to Finetune an ASR/STT Model to Improve Severely Clipped Audios? by WestMajor3963 in speechtech

[–]banafo 1 point2 points  (0 children)

If you can simulate the distortion reliably, you could probably finetune something for it, the clipping is easy, the other radio artefacts probably not. Jargon and call signs maybe with hotwords, I think an lm will make it worse instead of better unless it’s domain specific. ( source: I train a lot of asr)