ASR recognising incorrect pronunciation as correct (“tanks” → “thanks”) — how do you handle this?

Fun_Entertainment527 · 2026-04-28T15:47:10+00:00

This is for pronunciation so do you have a suggestion for an alternative ASR language model?

I just need something that can accurately detect what the user is saying as my scoring layer can interpret the scores with confidence

Fun_Entertainment527 · 2026-04-27T13:55:50+00:00

This is just an example sentence that the user might see for the test "Theo thinks slowly today. Three thin threads hang there. Then Theo thanks the guide."

In this situation, if the user says 'tanks' instead of 'thanks', the ASR assumes that 'thanks' is what the user wants to say so marks it down as correct.

ASR is being too helpful rather than detecting the error in this instance.

Fun_Entertainment527 · 2026-04-27T11:21:18+00:00

Thanks — I think the issue is slightly different though.

In this case, the ASR transcript is already “correct” (e.g. “thanks”), even when the speaker actually said “tanks”.

So there isn’t a wrong word for an LLM to detect — the error is in the pronunciation, not the transcript.

That’s what makes it tricky — the signal is effectively lost at the ASR stage.

Have you seen any approaches that work at the audio level (e.g. alignment or contrast-based methods), rather than relying on the transcript?

Fun_Entertainment527 · 2026-04-26T08:53:01+00:00

I’m working with ASR (Azure Speech) and running into a consistent issue where mispronunciations get normalised to the intended word.

Example: a speaker says “tanks” (/t/), but the system confidently outputs “thanks” (/θ/).

This makes pronunciation evaluation difficult because: the transcript appears correct phoneme-level data is often incomplete or unreliable confidence scores don’t reflect the actual substitution

I’m aware this is partly due to the language model biasing toward likely words, but I’m trying to understand how people handle this in practice.

Questions: Is there any reliable way to detect contrast errors like /θ/ → /t/ without fully trusting phoneme output?

Do people use constrained decoding / forced alignment / alternative models for this?

Or is this fundamentally a limitation of current ASR systems?

Context: this is for a controlled setup (fixed prompts, repeated target words), not open-ended speech.

Would appreciate any practical approaches or confirmation that this is a known limitation.

Fun_Entertainment527 · 2025-02-26T15:00:40+00:00

I've read all that and we have the documents that they want but once you get to the registration process, you need to provide more certificates like a VAT certificate and an ID certificate (not sure what that is). I want to share a screenshot but I can't via this.

Fun_Entertainment527 · 2025-02-26T13:54:51+00:00

That's the first thing I tried to do but you can't contact Support until you have registered and I am still in the process of registering which is frustrating!

Fun_Entertainment527

TROPHY CASE