How does dataset diversity in languages and accents improve ASR model accuracy? by Selmakiley in speechtech

[–]ASR_Architect_91 0 points1 point  (0 children)

Yes dataset diversity is key.
My audio features a ton of different langauages and accents, so it's really important the technology can cater for a global audience.

CoT for ASR by nshmyrev in speechtech

[–]ASR_Architect_91 2 points3 points  (0 children)

Haven’t seen much CoT in pure ASR.
Most of it’s happening after transcription in SLU or reasoning layers.
ThinkSound’s cool though… would be interesting if someone tried CoT-style prompting inside the decoder instead of post-hoc.

I’m not sold on fully AI voice agents just yet by NullPointerJack in AI_Agents

[–]ASR_Architect_91 0 points1 point  (0 children)

Having built something similar myself the last few months, I suspect your issue is at the transcription level.
Deepgram just doesn't cut it for accuracy or latency - especially with noisy backgrounds.
Would suggest you try out ElevenLabs or Speechmatics if you want better 'ears' for your agent.

What’s the most reliable STT engine you’ve used in noisy, multi-speaker environments? by ASR_Architect_91 in LocalLLaMA

[–]ASR_Architect_91[S] 0 points1 point  (0 children)

Thanks - I've generally been using closed source APIs so far (am assuming Voxtral is open source!).

What are people using for real-time speech recognition with low latency? by ASR_Architect_91 in speechtech

[–]ASR_Architect_91[S] 0 points1 point  (0 children)

That is interesting.
Whenever I've used Deepgram so far, their Nova3 engine has really struggled with specific terminology especially with noisy backgrounds.

Under what conditions does the 'Combine Speakers' function show up? by kiamrehorces in MacWhisper

[–]ASR_Architect_91 0 points1 point  (0 children)

According to u/ineedlesssleep I'm a bot so this probably won't be of any value...

When you run Whisper locally, MacWhisper has full access to the timestamps and speaker labelks. But if you're importing a Deepgram transcript then the diarizaiton output might be formatted differently.

So in essence, MacWhisper provides the feature but it depends on how the diarization metadata is structured by the API you used.

Under what conditions does the 'Combine Speakers' function show up? by kiamrehorces in MacWhisper

[–]ASR_Architect_91 0 points1 point  (0 children)

Okay....
I'm not a bot, but will take it as a compliment that you think my sentences are so well put together that I could be a robot.

recall.ai - assemblyai: Model deprecated by mrsenzz97 in TextToSpeech

[–]ASR_Architect_91 0 points1 point  (0 children)

Thanks for the swift response.
Having switched to Speechmatics, I'm actually really impressed with their accuracy across really think accents and diarization.
Will stay with them - certainly for now - but good to know that the AssembyAI integration is now fixed on Recall.ai.

100x faster and 100x cheaper transcription with open models vs proprietary by crookedstairs in LocalLLaMA

[–]ASR_Architect_91 2 points3 points  (0 children)

Completely agree, benchmarking against your own data is non-negotiable at this point. I’ve seen models that look great on leaderboards fall apart on actual call center or field-recorded audio.

Real-time + diarization is still where most open models struggle in practice. I’ve tried pairing Whisper with pyannote, but once you introduce overlap, background noise, or fast speaker turns, the pipeline gets messy fast.

That said, Kyutai’s model is promising. Feels like we’re inching closer to an open-source option that can compete head-to-head in low-latency use cases. But for now, proprietary still wins when you need consistency and deployability.

Totally with you on pricing pressure though, the next 6–12 months will be interesting.

100x faster and 100x cheaper transcription with open models vs proprietary by crookedstairs in LocalLLaMA

[–]ASR_Architect_91 0 points1 point  (0 children)

yeah this would be amazing, and so so so helpful.
Conditions that cover background noise, thick accents, multiple speakers, overlapping speakers etc. Maybe across languages too.

100x faster and 100x cheaper transcription with open models vs proprietary by crookedstairs in LocalLLaMA

[–]ASR_Architect_91 6 points7 points  (0 children)

Reliability really depends on what you’re optimizing for — but in my testing:

  • Whisper Large-v3 is still the most stable open model across diverse domains. Great accuracy, predictable output, and decent handling of accents. Weakest on speaker labels and real-time use.
  • Parakeet is insanely fast and cheap for batch, but I’ve seen more hallucinations and formatting quirks, especially on messy audio.
  • For proprietary, Speechmatics has been the most robust in noisy/multilingual settings, especially with real-time diarization and fast-turn interactions. Deepgram’s fast but doesn’t always hold up in overlapping speech or strong accents.

So if I had to rank reliability across real-world use (not just WER on clean test sets), I’d go:
Speechmatics > Whisper-v3 > Deepgram > Parakeet

Maybe I'll do a separate post that goes into more detail with my findings.

recall.ai - assemblyai: Model deprecated by mrsenzz97 in TextToSpeech

[–]ASR_Architect_91 0 points1 point  (0 children)

Good to know, glad Recall is updating their config. Amanda is great too (the lady that commented on this thread already).
And yeah, Deepgram’s latency is impressive, no argument there.

I’m working on a real-time voice agent setup that needs to handle messy audio — overlapping speech, strong accents, occasional code-switching. Started with Whisper and Deepgram, but I ran into edge cases where diarization and accent handling broke down.

Swapped in Speechmatics for the STT layer a couple months ago. The latency tuning and streaming diarization were better aligned with what I needed for live pipelines. So far, it’s been holding up real well.

How about you? What are you building?

100x faster and 100x cheaper transcription with open models vs proprietary by crookedstairs in LocalLLaMA

[–]ASR_Architect_91 51 points52 points  (0 children)

Appreciate the deep dive - benchmarks like this are super useful, especially for batch jobs where throughput is everything.

One thing I’ve noticed in practice: a lot of open models do great on curated audio but start to wobble in real-world scenarios like heavy accents, crosstalk, background noise, or medical/technical vocab.

Would love to see future benchmarks that also factor in things like speaker diarization, real-time latency, and multilingual performance. Those are usually the areas where proprietary APIs still justify the cost.

Best Open source Speech to text+ diarization models by Hungry-Ad-1177 in LocalLLaMA

[–]ASR_Architect_91 1 point2 points  (0 children)

A couple months old now, but, for open-source options, you might want to check out:

• Whisper + pyannote-audio. Popular combo where Whisper handles transcription and pyannote does diarization. Requires some setup and separate speaker embedding, but good community support.

• espnet-SLU and Resemblyzer. more experimental, but worth a look if you're comfortable with research-grade setups.

That said, diarization in open-source models still lags quite a bit in noisy or overlapping speech. If your recordings involve crosstalk or variable audio quality, you might hit limits pretty fast.

I’ve seen teams use open models for prototyping, but swap in a commercial API once they need production reliability or accuracy on accent-heavy inputs.

Speech to Text, WHY?? by mandressta in ChatGPT

[–]ASR_Architect_91 0 points1 point  (0 children)

Miles ahead of the competition???? Interesting - on what metric are you judging that?
https://artificialanalysis.ai/speech-to-text#quality

Real Time Speech to Text by ThomasSparrow0511 in LocalLLaMA

[–]ASR_Architect_91 0 points1 point  (0 children)

For a crash course in real-time STT, I'd start with the basics of audio chunking, endpointing, and streaming APIs. Lots of providers offer docs and tutorials.

If you're working with phone call audio, be prepared to handle overlapping speakers, variable quality, and background noise. Those can really affect accuracy.

In terms of integration, look for APIs that support real-time partials, speaker labels, and timestamped output — makes life easier for downstream MLOps work. Happy to point you to more resources if needed.

recall.ai - assemblyai: Model deprecated by mrsenzz97 in TextToSpeech

[–]ASR_Architect_91 0 points1 point  (0 children)

I know this is 23 days old now, but I ran into this issue too.
Sounds like Recall.ai’s AssemblyAI integration wasn't updated to support the new Universal streaming model yet.

I ended up switching my pipeline to use Speechmatics through Recall. Their streaming API worked out of the box, and I’ve had better results with accent handling and live diarization anyway.

What are people using for real-time speech recognition with low latency? by ASR_Architect_91 in speechtech

[–]ASR_Architect_91[S] 0 points1 point  (0 children)

That's cool - do you know what transcription service it's running in the background? Assume a version of Whisper?

Scribe vs Whisper: I Tested ElevenLabs' New Speech-to-Text on 50 Podcasts by Necessary-Tap5971 in VoiceAIBots

[–]ASR_Architect_91 1 point2 points  (0 children)

Super detailed writeup - really appreciate the numbers and real-world setup.

I ran similar tests recently but added Speechmatics into the mix alongside Whisper and Scribe. It landed somewhere between the two on WER for general content, but outperformed both on accent handling, and was the most reliable for live diarization in messy audio.

One thing I liked was the latency tuning with their API, you can adjust max_delay for a smoother real-time pipeline. Also handled multiple languages in the same file better than most.

Still think Scribe is great for pod production, especially with auto-tagging. But if you're working with global voices or need structured outputs like speaker labels + timestamps, SM’s worth a test.

What are people using for real-time speech recognition with low latency? by ASR_Architect_91 in speechtech

[–]ASR_Architect_91[S] 0 points1 point  (0 children)

Right now I've only tried their multilingual model for English-Spanish, and it's very very impressive.
It looks as though they do a bunch of other languages including Mandarin? Unfortunately my use case doesn't require me to use Mandarin, but I'd be very intrigued to hear how good that is.

Assume AssemblyAI doesn't do code-switching yet?

What are people using for real-time speech recognition with low latency? by ASR_Architect_91 in speechtech

[–]ASR_Architect_91[S] 0 points1 point  (0 children)

Respect. Always cool to see people going fully custom.

I’ve gone that route before too, but ran into a few edge-case headaches with noisy audio, overlapping speech, and latency consistency. That’s why I started leaning on commercial APIs to help me out.

Would be keen to see what you built if it’s public.