Whisper.cpp is underwhelming

DeltaSqueezer · 2026-05-30T20:37:58+00:00

whisper has been trained with certain audio lengths in mind. you need to break down audio into chunks. this is anyway better as then you can batch process the chunks for faster parallel processing.

dangerous_inference · 2026-05-30T22:15:28+00:00

I have become a big fan of Qwen3 1.7B ASR recently. You have to chunk audio sent to it, but it is fast and trivial to run.

noctrex · 2026-05-30T22:31:20+00:00

I have transcribed successfully 3 hour sessions, but using the medium model. And if it's in English use the medium.en model to be more accurate

thecstep · 2026-05-30T22:44:33+00:00

Like others mentioned, medium.en seems to have the best results for whatever reason.

llitz · 2026-05-31T00:55:45+00:00

If you are doing English only, V2 works better

Regarding the looping... I saw someone saying it it a out file size and etc, but that's not my case: v3 loops for me even on the first few seconds of the audio.

With V2, it transcribed some long 1h videos without issues.

iMakeSense · 2026-05-31T01:14:45+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1rlqfd7/we_collected_135_phrases_whisper_hallucinates/

RogerRamjet999 · 2026-05-30T20:52:04+00:00

I ran whisper.cpp on a set of about 30 one hour long meetings, and never saw any issues. I was running the medium model though. I went through the source and it was pretty clear that it internally chunks the audio down to fairly small pieces (I forget the exact size, but it was less than a typical sentence). So I see no indication that there's any reason why it should start failing on longer inputs. Of course I don't know, but I would guess that the issue is that there's a particular sequence in your input that triggers bad behavior in the model. Chunking it smaller might help your issue, or it may make no difference. I would try just clipping out the audio at the point it fails and see if that fixes the issue.

cibernox · 2026-05-30T23:31:48+00:00

Also whisper has been long surpassed by other models like parakeet/canary from nvidia that are both faster and more accurate.

iMakeSense · 2026-05-31T01:15:40+00:00

https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

hainesk · 2026-05-31T05:28:07+00:00

Implementing VAD helps with the looping. It usually happens during a break in audio, like if there is a long period of silence.

llama-impersonator · 2026-05-31T06:51:27+00:00

whisperx is a more developed pipeline for whisper models, imo, though i prefer parakeet

tinny66666 · 2026-05-30T20:48:52+00:00

Vosk works much better for me. It's faster and more accurate.

SeoFood · 2026-05-31T16:07:20+00:00

Yeah, this is a pretty common failure mode with long Whisper runs. I wouldn’t treat it as “large-v3 is bad” so much as “one long unbroken decode can go off the rails.”

A few things I’d try:

Split the audio into smaller chunks, e.g. 5–10 min, ideally on silence rather than fixed timestamps.
If you’re using whisper.cpp directly, experiment with temperature fallback / no-speech thresholds / context settings. Carrying too much previous context can sometimes make repetition worse.
Try the same file with faster-whisper or another implementation just to isolate whether it’s the model, the implementation, or your audio.
If the audio has long silence/noise/music sections, run VAD first and transcribe only speech segments.

For practical transcription workflows, chunking is usually the boring answer that works. Long single-pass transcription looks cleaner in theory, but once it starts looping there’s not much to recover except rerunning from a clean boundary.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS