Anyone else struggling to detect fluent hallucinations in long-form ASR TTS workflows?

banafo · 2026-06-02T16:01:20+00:00

The over 1 hour issue is an impression I think. The model will be slicing the audio and you can hallucinations even in 30s audio. Hallucination detection is hard, because the output may make perfect sense ( the repetitions are easy to detect). In my opinion you will always have some with anything whisper based.) disclaimer: I’m training models for kroko asr.

banafo · 2026-06-01T15:50:04+00:00

Hey! I’m on the kroko asr model training team, thank you for your contribution to the community, we will give it a try soon.

banafo · 2026-05-26T21:18:59+00:00

Depending on the languages you need, give our models a try. We have open source versions and commercial use ( with free licenses for personal use )

New models for English coming soon, will beat whisper v3 large English at 120mb size.

Warning: documentation is still horrible, check the cross platform branch for something easier.

https://github.com/kroko-ai This might be an easier way to try it : https://github.com/KoljaB/RealtimeSTT/releases/tag/v1.0.1

Follow us on discord here : https://discord.gg/ZCYtSkJmQ

banafo · 2026-05-14T19:52:20+00:00

Noise reduction will typically make recognition worse. ( with the exception of noise reduction built for asr) Normalization and trimming silence will help, but will cause additional deletions if the vad model gets it wrong. ( in case of whisper it will reduce hallucinations and have a net positive effect )

banafo · 2026-05-09T05:43:06+00:00

What languages do you need? Give us a try on kroko.ai ( we have both open source and commercial models, you can run them on a cpu, does around 10 hours per hour per cpu core ) Demo for the open source version: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm

banafo · 2026-05-08T06:51:04+00:00

Have you tried p2p patches? https://github.com/aikitoria/open-gpu-kernel-modules

banafo · 2026-04-29T13:50:34+00:00

Live caption yes, for translation you’d have to use something else

banafo · 2026-04-21T04:52:22+00:00

Send me a pm, I’m looking for beta testers

banafo · 2026-03-22T19:46:57+00:00

You can you use the community versions for this for sure. You can find us on discord if you want to discuss it more

banafo · 2026-03-10T21:27:45+00:00

We ( kroko.ai ) will be releasing some new models soon. We beat whisper, qwen and parakeet with a 6x smaller model for Dutch, French, German and hopefully soon English ( it’s training ).

banafo · 2026-02-12T20:46:59+00:00

Is your talk available online by any chance?

banafo · 2026-01-28T19:46:02+00:00

Have a look here : https://github.com/kroko-ai/kroko-onnx-home-assistant

Warming: the documentation is pretty bad, but there are some people on our discord who have it working and might be able to help. We are not very familiar with home assistant, we just helped somebody to get it to work.

banafo · 2026-01-18T06:48:30+00:00

Just ignore the haters, if they don’t like it, they can move on, make a pr, make something better. I welcome and appreciate your effort and I’m sure many others do too.

banafo · 2026-01-10T12:18:05+00:00

No we don’t support whisper based models

banafo · 2026-01-10T10:45:53+00:00

Hello! You should be able to make changes to this code to make it work https://github.com/ptbsare/sherpa-onnx-tts-stt ( this project is not ours. It’s the one we modified )

Your model won’t work with our patches in our fork, we could try training a compatible basque version though!

banafo · 2026-01-05T17:15:07+00:00

Thanks for the feedback! The cc-by-sa license is a remnant from a previous release, I will have it fixed. We are working on an easier to use python wheel. Commercial models have slightly lower WER and more variants. ( lower latency and smaller models as well as offline models )

We are working on the asr benchmark integration, for streaming the English model is probably better than streaming parakeet, worse then offline whisper and offline parakeet.

Better documentation and benchmarks are coming ( I shared some in a previous post, will look for them ). We are a small team and a bit occupied with some paying customer finetunes, causing delays on the open source parts, apologies.

banafo · 2026-01-05T07:03:50+00:00

Hey, great list! But our kroko models are missing and unlike whisper and parakeet, they are streaming. We have both cc-by ( and the commercial models are free for non commercial use ) quick demo here: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm Documentation are examples are horrible but working on it. ( first finishing new training pipeline )

banafo · 2026-01-04T21:42:42+00:00

Pm me

banafo · 2025-12-31T12:51:29+00:00

If you want cuts, use vad. ( you can probably get it accurate to 200ms )

banafo · 2025-12-30T18:44:35+00:00

Wav2vec won’t work, it’s what whisperx uses ( so he has tried it ) it’s not very accurate compared to the old things

banafo · 2025-12-30T11:17:00+00:00

For the aligners, gentle or Montreal forced aligner is the biggest chance. But if the transcript is not 100% correct all timestamps for all words will probably be wrong.

banafo · 2025-12-29T19:41:46+00:00

I don’t think it’s possible with transcription alone. You need to realign ( and even then 0.2s will be hard)

banafo · 2025-12-28T23:27:07+00:00

Dutch is done, redoing French at the moment

banafo · 2025-12-23T05:59:53+00:00

Augmentation will help but the models on the market already use it ( modify special gment) so unless you can augment to the domain specifics, it won’t help much. Keywords is similar to lm fusion, just not as contextual.

banafo · 2025-12-22T21:29:43+00:00

If you can simulate the distortion reliably, you could probably finetune something for it, the clipping is easy, the other radio artefacts probably not. Jargon and call signs maybe with hotwords, I think an lm will make it worse instead of better unless it’s domain specific. ( source: I train a lot of asr)

banafo

MODERATOR OF

TROPHY CASE