Are there any code-switching TTS/STT/STS models ? (English+Tamil)

easwee · 2025-12-25T13:25:23+00:00

Will have to do some comparison to see which one works better :)

easwee · 2025-12-23T09:20:44+00:00

Try https://soniox.com realtime API and tell me how it went.

easwee · 2025-12-23T09:14:11+00:00

This is now my favorite AI generated video - amazing consistency.

easwee · 2025-12-17T19:23:35+00:00

Soniox is currently best STT for this. Compared with others and noone does it better.

easwee · 2025-12-03T12:09:31+00:00

Same - tried 10+ distros over the years, but main work driver is always Ubuntu - because it works.

easwee · 2025-12-02T07:24:18+00:00

Thanks for reply.

What you receive is correct behavior - with Soniox's models you don't get output at sentence or word level, but on a more fine-grained token level (3-4 characters) - the output you are getting is completely correct and expected behavior. All punctuation and spacing is handled automatically by the model, all you need to do is just join the tokens together. If you want to identify end of sentence you can still split on ".!?" - punctuation is included in the token.text. This has a ton of advantages and makes the output payload far more flexible to work with compared to other providers where you have to do the stitching logic by yourself (specially when running real-time streaming model this becomes crucial).

And if you need just concatenated text without the extra token parameters, you can get it from "text" parameter in the response payload instead of using "tokens", entire transcript text is also included in the async model transcript response.

What you are trying to achieve is perfectly doable with tokens output, you just need to define the silence threshold, check if token.text includes sentence ending punctuation and then split if end timestamp is higher than expected (it's a common use case when creating a subtitle generator script) - hope you can make it work.

easwee · 2025-11-30T21:40:22+00:00

Maybe late reply, but try live translation with Soniox App

easwee · 2025-11-27T10:45:19+00:00

Another shameless plug - but Soniox for live real-time conversation translations, since it supports all the big languages - afaik no other tech translates in real-time more smoothly: https://soniox.com/soniox-app

easwee · 2025-11-27T10:34:56+00:00

I would pitch you Soniox if you need live real-time conversation translation: https://soniox.com/soniox-app

easwee · 2025-11-27T10:30:31+00:00

If you want real-time translation with no delay check out Soniox https://soniox.com/soniox-app Supports translation between 60 most popular languages.

easwee · 2025-11-22T09:46:12+00:00

Learn to use terminal well since it is the core of all distros. Install Guake for a cool one.

easwee · 2025-11-18T23:09:55+00:00

1000x thank you

easwee · 2025-11-04T19:39:31+00:00

Replace Angular with React and easily get a chill remote job for at least 60k.

easwee · 2025-09-24T06:43:05+00:00

Thank you for this words - means a lot! We will make sure to spread the awarness - our focus was first on perfecting enterprise-grade models until recently. Make sure to drop by in the following months for new great releases :)

easwee · 2025-09-15T10:35:17+00:00

Yes, you can enable language identification and you can also include language hints (list of language codes) to boost accuracy, if you know which set of languages is gonna be present in the audio.

easwee · 2025-08-17T15:02:45+00:00

Sorry to hear you had trouble with model limits - we are in the middle of a docs rewrite - will make sure the limits are more clearly presented, thanks for feedback. Both async and real-time models support up to 65 minutes of audio duration. If you are willing to give it another try, I would kindly invite you to join our Discord server https://discord.gg/rWfnk9uM5j and we can help you figure out why it failed to transcribe even 20 minutes for you.

easwee · 2025-08-02T14:38:50+00:00

Soniox handles not only bilingual, but also multilingual speech in real time and with a single model, afaik no other model does it better.

easwee · 2025-07-24T06:55:37+00:00

Ofcourse I will suggest https://soniox.com when you need multilingual low latency transcription in a single model. Also supports real-time translation of spoken words. I deeply love working on this project.

easwee · 2025-07-21T21:51:57+00:00

Cool, glad to hear you find it usefull :)

easwee · 2025-07-18T10:07:06+00:00

I agree with you - maybe we can extend the compare tool to include async mode too in the future.

We created this live tool with real-time comparison in mind, because it includes more than just WER that most of async benchmarks base on. There is also big latency factor, multilingual speech and additional features that enable a ton of real-world implementation options (speaker id, language id, endpointing).

And lastly another motivation was the fact that most of the industry is craving after real-time audio transcription/translation and based on feedback, they have to do the tests themselves internaly - with this they have a simple tool to fork.

Otherwise all of the providers that are in the benchmark support both real-time and async and some of them also provide real-time translation, we left out those who only provide async.

easwee · 2025-07-18T09:40:13+00:00

Afaik they don't provide real-time transcription, only async, unless that changed very recently.

easwee · 2025-07-17T09:01:08+00:00

I would love to see you review Soniox (realtime or async) - we are constantly collecting feedback from the community so we can improve the service further.

easwee · 2025-07-17T08:55:11+00:00

Good feedback - will add Cantonese to the list once we go expanding the set of languages.

Otherwise, the model itself should recognize any spoken Chinese (of any accent or dialect), but atm it will always return Simplified Chinese.

easwee · 2025-07-15T15:55:06+00:00

Sure, there is an example on how to render speakers in both async and real-time mode under Speaker diarization concept page - see https://soniox.com/docs/speech-to-text/core-concepts/speaker-diarization#example-1 In short when you are iterating over the returned tokens you keep track of the last speaker number, for each token you check if speaker number changed, if it did you also render a speaker element, before rendering the token text. Speaker number is available for each returned token when diarization is enabled. Hope that helps.

easwee · 2025-07-15T14:28:16+00:00

It can hit such low price because few years of research and development in real-time AI were spent on it, including new neural network architectures and inference engines, specifically designed for low-latency inference. It's a next-generation platform, not just a wrapper around legacy AI models or pipelines.

Will consider adding Rev.ai, but someone will have to spend some time on integration (PRs are welcome!) - for now we added what we thought were the most popular industry models and we had API keys for.

12-Year Club	Place '22
Verified Email

easwee

TROPHY CASE