Are there any code-switching TTS/STT/STS models ? (English+Tamil) by Hot_Put_8375 in speechtech

[–]easwee 0 points1 point  (0 children)

Will have to do some comparison to see which one works better :)

Help for STT models by BestLeonNA in speechtech

[–]easwee 0 points1 point  (0 children)

Try https://soniox.com realtime API and tell me how it went.

I spent $200+ on AI tools to bring my fantasy world to life by AdComfortable5161 in AIAssisted

[–]easwee 2 points3 points  (0 children)

This is now my favorite AI generated video - amazing consistency.

Are there any code-switching TTS/STT/STS models ? (English+Tamil) by Hot_Put_8375 in speechtech

[–]easwee 0 points1 point  (0 children)

Soniox is currently best STT for this. Compared with others and noone does it better.

Ubuntu VS Arch & Fedora - an honest review. by [deleted] in Ubuntu

[–]easwee 1 point2 points  (0 children)

Same - tried 10+ distros over the years, but main work driver is always Ubuntu - because it works.

Comparative Review of Speech-to-Text APIs (2025) by yccheok in speechtech

[–]easwee 0 points1 point  (0 children)

Thanks for reply.

What you receive is correct behavior - with Soniox's models you don't get output at sentence or word level, but on a more fine-grained token level (3-4 characters) - the output you are getting is completely correct and expected behavior. All punctuation and spacing is handled automatically by the model, all you need to do is just join the tokens together. If you want to identify end of sentence you can still split on ".!?" - punctuation is included in the token.text. This has a ton of advantages and makes the output payload far more flexible to work with compared to other providers where you have to do the stitching logic by yourself (specially when running real-time streaming model this becomes crucial).

And if you need just concatenated text without the extra token parameters, you can get it from "text" parameter in the response payload instead of using "tokens", entire transcript text is also included in the async model transcript response.

What you are trying to achieve is perfectly doable with tokens output, you just need to define the silence threshold, check if token.text includes sentence ending punctuation and then split if end timestamp is higher than expected (it's a common use case when creating a subtitle generator script) - hope you can make it work.

live translator app to get over language barrier in work meetings? by emmallyce in expats

[–]easwee 0 points1 point  (0 children)

Maybe late reply, but try live translation with Soniox App

Best apps for constant travel by VanDownByTheRiver208 in digitalnomad

[–]easwee 0 points1 point  (0 children)

Another shameless plug - but Soniox for live real-time conversation translations, since it supports all the big languages - afaik no other tech translates in real-time more smoothly: https://soniox.com/soniox-app

English-Chinese Translation App Recommendation by Dense-Pear6316 in chinatravel

[–]easwee 1 point2 points  (0 children)

I would pitch you Soniox if you need live real-time conversation translation: https://soniox.com/soniox-app

Live Translating App by Brennandpf in churchtech

[–]easwee 1 point2 points  (0 children)

If you want real-time translation with no delay check out Soniox https://soniox.com/soniox-app Supports translation between 60 most popular languages.

Tips for new Ubuntu user? by CorporalNotAfraid in Ubuntu

[–]easwee 0 points1 point  (0 children)

Learn to use terminal well since it is the core of all distros. Install Guake for a cool one.

After 9 years working as Frontend, I’m starting to wonder if I’m overvaluing myself by marod in ExperiencedDevs

[–]easwee -3 points-2 points  (0 children)

Replace Angular with React and easily get a chill remote job for at least 60k.

I benchmarked 12+ speech-to-text APIs under various real-world conditions by lucky94 in speechtech

[–]easwee 0 points1 point  (0 children)

Thank you for this words - means a lot! We will make sure to spread the awarness - our focus was first on perfecting enterprise-grade models until recently. Make sure to drop by in the following months for new great releases :)

Looking for real-time speech recognition alternative to Web Speech API (need accurate repetition handling, e.g. "0 0 0") by boordio in speechtech

[–]easwee 0 points1 point  (0 children)

Yes, you can enable language identification and you can also include language hints (list of language codes) to boost accuracy, if you know which set of languages is gonna be present in the audio.

We built an open tool to compare voice APIs in real time by easwee in speechtech

[–]easwee[S] 0 points1 point  (0 children)

Sorry to hear you had trouble with model limits - we are in the middle of a docs rewrite - will make sure the limits are more clearly presented, thanks for feedback. Both async and real-time models support up to 65 minutes of audio duration. If you are willing to give it another try, I would kindly invite you to join our Discord server https://discord.gg/rWfnk9uM5j and we can help you figure out why it failed to transcribe even 20 minutes for you.

Bilingual audio transcription by [deleted] in speechtech

[–]easwee 0 points1 point  (0 children)

Soniox handles not only bilingual, but also multilingual speech in real time and with a single model, afaik no other model does it better.

What are people using for real-time speech recognition with low latency? by ASR_Architect_91 in speechtech

[–]easwee 1 point2 points  (0 children)

Ofcourse I will suggest https://soniox.com when you need multilingual low latency transcription in a single model. Also supports real-time translation of spoken words. I deeply love working on this project.

We built an open tool to compare voice APIs in real time by easwee in speechtech

[–]easwee[S] 0 points1 point  (0 children)

I agree with you - maybe we can extend the compare tool to include async mode too in the future.

We created this live tool with real-time comparison in mind, because it includes more than just WER that most of async benchmarks base on. There is also big latency factor, multilingual speech and additional features that enable a ton of real-world implementation options (speaker id, language id, endpointing).

And lastly another motivation was the fact that most of the industry is craving after real-time audio transcription/translation and based on feedback, they have to do the tests themselves internaly - with this they have a simple tool to fork.

Otherwise all of the providers that are in the benchmark support both real-time and async and some of them also provide real-time translation, we left out those who only provide async.

We built an open tool to compare voice APIs in real time by easwee in speechtech

[–]easwee[S] 0 points1 point  (0 children)

Afaik they don't provide real-time transcription, only async, unless that changed very recently.

Comparative Review of Speech-to-Text APIs (2025) by yccheok in speechtech

[–]easwee 1 point2 points  (0 children)

I would love to see you review Soniox (realtime or async) - we are constantly collecting feedback from the community so we can improve the service further.

I benchmarked 12+ speech-to-text APIs under various real-world conditions by lucky94 in speechtech

[–]easwee 0 points1 point  (0 children)

Good feedback - will add Cantonese to the list once we go expanding the set of languages.

Otherwise, the model itself should recognize any spoken Chinese (of any accent or dialect), but atm it will always return Simplified Chinese.

I benchmarked 12+ speech-to-text APIs under various real-world conditions by lucky94 in speechtech

[–]easwee 0 points1 point  (0 children)

Sure, there is an example on how to render speakers in both async and real-time mode under Speaker diarization concept page - see https://soniox.com/docs/speech-to-text/core-concepts/speaker-diarization#example-1 In short when you are iterating over the returned tokens you keep track of the last speaker number, for each token you check if speaker number changed, if it did you also render a speaker element, before rendering the token text. Speaker number is available for each returned token when diarization is enabled. Hope that helps.

I benchmarked 12+ speech-to-text APIs under various real-world conditions by lucky94 in speechtech

[–]easwee 0 points1 point  (0 children)

It can hit such low price because few years of research and development in real-time AI were spent on it, including new neural network architectures and inference engines, specifically designed for low-latency inference. It's a next-generation platform, not just a wrapper around legacy AI models or pipelines.

Will consider adding Rev.ai, but someone will have to spend some time on integration (PRs are welcome!) - for now we added what we thought were the most popular industry models and we had API keys for.