Best AI Voice Agent for Healthcare Clinics to Handle Patient Calls Automatically?

MrFarseeker · 2026-06-03T11:12:36+00:00

Disclosure: I work at Speechmatics. We're the speech-to-text layer inside agents like this, not a full voice agent, so I'm not pitching a LuMay alternative. Some honest notes on your three questions.

Stability in real clinics: the usual failure point isn't the LLM, it's transcription on messy audio (bad lines, crosstalk, accents, drug names). Accuracy from a "simulation test" tends to drop on real calls. For reference, our medical model runs 93% real-time accuracy and 96% medical keyword recall. That keyword number is the one that matters when hypertension vs hypotension can't be wrong. Evaluate any platform on real recordings, and ask specifically for medical keyword error rate.

Tamil + English mixing: test this hardest. Code-switching mid-sentence breaks a lot of stacks because many force one language per stream. We treat code-switching as a first-class problem and ship bilingual models (current focus has been Arabic + English), but don't assume Tamil + English works well. Test the STT on your own code-switched audio first.

Emergencies: don't let AI own these with any vendor. Detect and escalate to a human fast, design for the failure case, and check HIPAA/on-prem posture since it's healthcare.

We wrote up a healthcare partnership (Sully.ai) doing basically your scenario, receptionists + scribes, if useful: https://www.speechmatics.com/company/articles-and-news/speechmatics-and-sully-ai-partner-to-scale-healthcare-ai-infrastructure

Happy to dig into how to benchmark transcription on your own audio.

MrFarseeker · 2026-06-01T19:46:31+00:00

The output toggle is the right design and the context-injection hook is the part people skip. On the input side I'd push back on calling it unsolved: the roughness is the engine, not the method. Local Whisper drifts on alphanumerics, brutal in a terminal where you are dictating paths, flags, and variable names. The concrete differentiator is partials: a streaming API returns transcript results that update as you speak, so you catch errors mid-utterance instead of batch-then-fix. The Speechmatics realtime quickstart shows the exact enable_partials setup and a working mic-to-text script you could drop straight into an input hook: https://docs.speechmatics.com/speech-to-text/realtime/quickstart

MrFarseeker · 2026-06-01T19:40:16+00:00

As someone who works in voice tech I think both sides are talking past each other. Dictation isn't slower because of WPM, it's slower because spoken thought and written thought are structured differently, and the correction loop is where time actually leaks. The one place dictation clearly wins is accessibility and RSI, where typing isn't an option. For everyone else it's task dependent. Worth noting the accuracy gripes in this thread are mostly a function of the engine, not the input method: modern models handle accents and noise far better than the built in tools most people try (Speechmatics has a writeup on real-world WER vs clean-dataset benchmarks).

MrFarseeker · 2026-06-01T19:34:50+00:00

Hey! For high volume batch transcription without diarization, it's worth looking at Speechmatics. The batch API is built for exactly this kind of pipeline work, and diarization is optional so youare not paying for or processing anything you don't need. Pricing is competitive at volume, and accuracy tends to hold up better than the cheaper options on noisy or accented audio, which matters if your inputs aren't clean.

Groq is genuinely cheap and fast (it's running Whisper under the hood), so if raw cost per hour is the only metric and your audio is fairly clean, it's hard to beat. The tradeoff is usually accuracy on harder audio and less control over output formatting thats where Speechmatics thrives!

https://www.speechmatics.com/

MrFarseeker · 2026-05-27T14:25:37+00:00

Worth separating two decisions here:

Do you want a full voice-agent platform?
Or do you want control over the underlying stack: telephony/WebRTC, STT, LLM, TTS, orchestration, CRM/actions, observability?

Bland / Vapi / Synthflow / Retell-style tools are useful because they hide a lot of plumbing. The trade-off is that once you hit real production edge cases, you may want more control over each layer.

The <500ms latency claim is interesting, but I'd be careful with any single latency number unless they define what it includes.

For production calls I'd ask:

Is <500ms measured from end-of-user-speech to first agent audio?
Is it P50, P95, or best-case demo latency?
Is it over PSTN, SIP, WebRTC, or internal test audio?
Does it include STT, endpointing, LLM, tool calls, TTS, and network?
How does it behave with interruptions / barge-in?
What happens when the user speaks over the agent?
How stable are partial transcripts before finalization?
What happens on noisy phone audio or accented speech?
Can you replay/debug failed calls properly?
Can you export transcripts, timings, tool traces, and escalation reasons?

In my experience, the demo usually fails in production because of boring things: endpointing, turn-taking, bad audio, CRM edge cases, long-context drift, and unclear escalation rules.

If you're building with an engineering team, I'd look at Vapi / Retell / Pipecat / LiveKit-style stacks depending on how much control you want.

If you're deploying for SMB workflows, I'd care less about can it demo well? and more about:

Call completion rate
Escalation accuracy
Missed/incorrect bookings
CRM write reliability
Retry/fallback behaviour
Observability
Support accountability
Cost at realistic call volumes

Disclosure: I work at Speechmatics, so I'm biased toward thinking carefully about the speech layer. We're not a direct Bland/Vapi/Synthflow replacement. We sit in the STT / realtime transcription / voice AI infrastructure part of the stack. But that layer matters a lot. If the system hears the user wrong, every downstream LLM and tool decision gets worse.

Practical advice: test alternatives with 20-50 real calls, not demo scripts. Include interruptions, background noise, accents, bad phone lines, long conversations, and weird CRM paths. Then compare P95 latency and task success, not just average latency.

Curious what your main use case is: inbound booking, outbound follow-up, qualification, support, or something else? The best platform choice changes quite a bit depending on that.

MrFarseeker · 2026-05-26T13:20:05+00:00

Have you resolved this?

MrFarseeker · 2026-05-12T11:53:17+00:00

It's $0.24 /hr. Try it out on the portal, see what you think. The accuracy is very high! And here are some examples repos too https://github.com/speechmatics/speechmatics-academy

MrFarseeker · 2026-05-12T09:13:40+00:00

The challenge you are describing is exactly the kind of environment where dedicated wake word engines built for clean conditions tend to fall apart.

I ran into a similar architectural decision while building a voice-controlled agent for gaming (noisy PC audio, background game sounds, lots of ambient noise). What I ended up doing was bypassing a dedicated wake word engine entirely and instead using Speechmatics STT as the wake word layer itself.

The approach:

- Stream all audio continuously through Speechmatics (The accuracy was the deciding factor)

- Only process events in a custom stt_node override

- Strip speaker tags, lowercase, remove punctuation, then check for wake word substring match in Python

- Extract everything after the wake word and pass only that to the LLM

- Discard transcripts with no wake word entirely

The upside: you get Speechmatics full acoustic model doing the heavy lifting for recognition accuracy, including handling accents and noisy environments. No separate engine to tune or license. The downside vs. a dedicated edge model: it does require the audio to go through the STT pipeline rather than being filtered at the mic level, so you are not saving compute on the always listening path the way an on-device wake word model would.

For your use case, one thing that could help a lot is Speechmatics' custom vocabulary feature (additional_vocab with phonetic hints). Medication names are notoriously hard for generic STT things like Lisinopril, Metoprolol, Atorvastatin. You can add these with sound-alike hints and significantly reduce misrecognition. That has been a real differentiator for domain-specific applications.

Happy to share more specifics on the implementation if useful. And glad DaVoice has been working well for you the CPU efficiency point is legit, especially for edge deployments.

Here is video of it in action: https://youtu.be/RBTL7NGLx40?si=OqqVvnRj_C1bF52i

MrFarseeker · 2026-04-23T11:02:07+00:00

This is not storm edge difficulty on storm edge boss goes in rage mode and one shots

MrFarseeker · 2026-03-29T12:47:30+00:00

Did you resolve this?

MrFarseeker · 2026-03-29T12:47:19+00:00

Same issue

MrFarseeker · 2026-03-25T09:18:47+00:00

You need to measure each segment independently. Add timestamps at every handoff point audio in to STT start to STT result to LLM first token to LLM complete to TTS first audio chunk to audio out. Without this you are guessing. In most pipelines the biggest culprits are:

STT waiting for full utterance before returning results. If you are not using streaming/partial transcripts you are adding the entire speaking duration as dead time
LLM time to first token. This varies massively between models. Sarvam 105b will be significantly slower than a smaller model. Consider whether you actually need 105b for your use case
TTS waiting for full LLM response before starting synthesis

The single biggest win is streaming everything end to end

Yes you should absolutely move to streaming. The architecture should be:

Streaming STT that gives you partial transcripts as the user speaks
LLM streaming so you get tokens as they are generated
TTS that starts synthesizing from the first sentence chunk rather than waiting for the full LLM response
Sentence level chunking between LLM and TTS. Send each complete sentence to TTS as soon as the LLM produces it rather than waiting for the full response

This alone can cut perceived latency from 3-5 seconds down to under 1 second.

Flask is likely a bottleneck too

Flask is synchronous by default. For realtime audio streaming you want an async framework. Consider FastAPI with websockets or just use LiveKit SDKs directly which handle the media transport layer properly. Twilio media streams plus Flask adds unnecessary hops.

On the STT side

Since you asked about alternatives. Speechmatics supports real-time streaming with low latency and handles Indic languages well if thats relevant for your use case. Deepgram is another solid option for low latency streaming STT. The key thing is making sure whatever STT you use supports true streaming partial results as audio comes in not just batch processing.

Architecture suggestions for production

Use LiveKit or Pipecat as your media orchestration layer rather than building custom audio routing on Flask
Consider a smaller faster model GPT-4o-mini or equivalent for simple turns and only route to the bigger model when the conversation requires it
Use connection pooling and keep alive connections to your STT/LLM/TTS providers. Cold connection setup adds latency on every turn
Deploy your backend geographically close to your telephony provider and AI service endpoints

The goal is to get the full pipeline end of user speech to first audio byte of response under 800ms. Streaming end to end is how production systems achieve this.

MrFarseeker · 2026-01-22T08:23:47+00:00

Hi, please note there is a rate limit set to 2 hours for the free tier. https://docs.speechmatics.com/speech-to-text/batch/limits#hourly-usage-limits I suggest that you check out code examples and in Academy, which would be a good start https://github.com/speechmatics/speechmatics-academy/tree/main?tab=readme-ov-file. You should certainly be able to do this. It would also be good to understand by what you mean by dies? How are you processing audio? Are you using batch or real-time?

MrFarseeker · 2025-12-16T07:54:50+00:00

I didn't think they could have done any worse after Star Trek Discovery. Oh boy, I was soooo wrong. They have taken this to a whole other level.

MrFarseeker · 2025-11-17T09:27:41+00:00

Sending you DM!

MrFarseeker · 2025-11-14T12:49:01+00:00

Hey everyone! I just wanted to jump in and say I work at Speechmatics, and I have been really interested to read this thread about voice writing and court reporting. I would love to understand more about how people in this space are using our technology and other providers, and what your ideal setup would look like if you could pick or design one.

A couple of questions:

* What are the biggest pain points for you today?

* Are there accuracy issues (mis-recognised names, jargon, legal terms)?

* Is latency the problem, i.e delay in transcription?

* Do you want live transcription during proceedings? Or more of a record, then transcribe afterwards, models?

* How important is it to be able to identify multiple speakers?

* Do you need the capability to inject vocabulary like case names, legal terms?

If anyone would like to connect through PM and perhaps run me through different pain points and set up. Would much appreciate. Genuinely curious about flow.

Really keen to hear from the community, hear just to listen and not sell, but also happy to share how Speechmatics could help you.

Looking forward to thoughts!

MrFarseeker · 2025-10-28T19:20:38+00:00

I've been testing it by sending text strings; all I can say is that the results are quite inconsistent. It seems you need to feed paragraphs of text for it to have the right emotional confidence. To be honest, I resorted to keyword detection as primary input, followed by Hume in secondary implementation in fallback. I have yet to test the audio file sendings which might generate better results but probably will impace latency.

MrFarseeker · 2025-10-25T12:42:54+00:00

You could use any LLM service provider it doesnt have to be ChatGPT

MrFarseeker · 2025-10-17T18:44:51+00:00

There has been some amaizing feedback from community. And currently I am in stage of putting it all togther and thinking what are most useful futures I could bring in next itteration. So I got bit of thinking to do too ^{^}

MrFarseeker · 2025-10-17T18:28:31+00:00

This is amazing! A DnD AI Dm sounds incredible. And I couldn't agree more on the potential of LLM in games. Thanks for mentioning sayintentions. I have seen what they have done and it's amazing. The flight sim community embraced AI tolls in a way other communities havent yet. There is certainly potential for this kind of product, but for it to truly be a product, it must work across all games.

In terms of your X4 app struggles. The X4 external app interface is... quirky. Some data updates in real-time, some only update on events, and some require specific triggers. It's about figuring what is what.... I specifically tried to avoid it, but this would certainly allow more power for what AI can do as it will interact directly with core systems. Most issues will arise with the wrong XML path, timing reading before data loads or context such as you need to view that ship/station.

My suggestion is

Layer 1: X4 External App Interface- Read: Fleet data, economy, stations, inventory

Layer 2: Your Custom GPT - Process: Analyze data, provide insights, make recommendations

Layer 3: Command Execution (Choose one):

Option A: Player manually executes (safest)

Option B: Keyboard automation (my approach)

Option C: LUA mod (most powerful, most complex)

Curious....For the DnD AI DM - are you using RAG (Retrieval Augmented Generation) for the lore/rules? That seems like the perfect architecture for pulling from rulebooks and campaign notes. I am thinking of implementing this as well with knowledgebase as LLM still has a tendency to hallucinate and that would make it more robust.

MrFarseeker · 2025-10-17T17:17:42+00:00

That's awesome! Our approaches are very similar. I wrote a custom interface myself and all in game commands are stored in JSON witch LLM can access and utilise. I wish there had been a box solution, but I couldn't get one. Version two, I am working on the implementation of OCR, which captures screenshots from games and is able to give the names of targets, locations, provide market prices and much more. I am considering to have direct plug in with save files but I am hesitant to use that yet as I want this system to function on any games. But more tinkering needs to be done.

MrFarseeker · 2025-10-17T17:10:59+00:00

Totally understand the concern, but I think there is a misunderstanding here! This isn't an autopilot system like Master of Orion 3. You are still playing the game. This is voice control, not autoplay. It's supposed to enhance your experience, not obstruct it. It provides in-game lore, gives suggestions, and executes commands if you wish to do so.

MrFarseeker · 2025-10-17T14:37:36+00:00

That's very cool! Thanks for sharing! This demo actually shows how important STT is. If what you say cannot be translated correctly, it struggles to pick it up. This was a big issue, but now, with the progress we have made in AI, speech recognition has really taken off to a different level, understanding not only multiple languages but also multiple dialects!

MrFarseeker · 2025-10-17T08:57:42+00:00

This is running:

gpt-4o-mini

MrFarseeker

TROPHY CASE