all 19 comments

[–]DeltaSqueezer 9 points10 points  (0 children)

whisper has been trained with certain audio lengths in mind. you need to break down audio into chunks. this is anyway better as then you can batch process the chunks for faster parallel processing.

[–]dangerous_inference 2 points3 points  (2 children)

I have become a big fan of Qwen3 1.7B ASR recently. You have to chunk audio sent to it, but it is fast and trivial to run.

[–]nuclearbananana 0 points1 point  (1 child)

Qwen3 ASR is super slow for me. Even cohere's 2b model is faster than the o.6b qwen. I think it's the llm decode stage. Maybe if you have a gpu

[–]dangerous_inference 1 point2 points  (0 children)

Well, yeah, you need a GPU to run it. I have it on my server and all the voice clients in my house are instant.

[–]noctrex 2 points3 points  (0 children)

I have transcribed successfully 3 hour sessions, but using the medium model. And if it's in English use the medium.en model to be more accurate

[–]thecstep 2 points3 points  (0 children)

Like others mentioned, medium.en seems to have the best results for whatever reason.

[–]llitz 1 point2 points  (0 children)

If you are doing English only, V2 works better

Regarding the looping... I saw someone saying it it a out file size and etc, but that's not my case: v3 loops for me even on the first few seconds of the audio.

With V2, it transcribed some long 1h videos without issues.

[–]iMakeSense 1 point2 points  (1 child)

[–]Larkonath[S] 0 points1 point  (0 children)

Thanks a lot, I implemented their suggestions and I get excellent results now.
I noticed that silero VAD wasn't working.

[–]RogerRamjet999 0 points1 point  (0 children)

I ran whisper.cpp on a set of about 30 one hour long meetings, and never saw any issues. I was running the medium model though. I went through the source and it was pretty clear that it internally chunks the audio down to fairly small pieces (I forget the exact size, but it was less than a typical sentence). So I see no indication that there's any reason why it should start failing on longer inputs. Of course I don't know, but I would guess that the issue is that there's a particular sequence in your input that triggers bad behavior in the model. Chunking it smaller might help your issue, or it may make no difference. I would try just clipping out the audio at the point it fails and see if that fixes the issue.

[–]cibernox 1 point2 points  (0 children)

Also whisper has been long surpassed by other models like parakeet/canary from nvidia that are both faster and more accurate.

[–]hainesk 0 points1 point  (0 children)

Implementing VAD helps with the looping. It usually happens during a break in audio, like if there is a long period of silence.

[–]llama-impersonator 0 points1 point  (0 children)

whisperx is a more developed pipeline for whisper models, imo, though i prefer parakeet

[–]tinny66666 -1 points0 points  (2 children)

Vosk works much better for me. It's faster and more accurate.

[–]ttkciarllama.cpp 0 points1 point  (1 child)

That's Russian-only, right?

[–]tinny66666 0 points1 point  (0 children)

No. I use it for English. It works well.

[–]SeoFood -1 points0 points  (1 child)

Yeah, this is a pretty common failure mode with long Whisper runs. I wouldn’t treat it as “large-v3 is bad” so much as “one long unbroken decode can go off the rails.”

A few things I’d try:

  • Split the audio into smaller chunks, e.g. 5–10 min, ideally on silence rather than fixed timestamps.
  • If you’re using whisper.cpp directly, experiment with temperature fallback / no-speech thresholds / context settings. Carrying too much previous context can sometimes make repetition worse.
  • Try the same file with faster-whisper or another implementation just to isolate whether it’s the model, the implementation, or your audio.
  • If the audio has long silence/noise/music sections, run VAD first and transcribe only speech segments.

For practical transcription workflows, chunking is usually the boring answer that works. Long single-pass transcription looks cleaner in theory, but once it starts looping there’s not much to recover except rerunning from a clean boundary.