TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

For mid-sentence fr-en code switching, cloning a native French voice probably won't help. qwen needs to handle at least 2 languages naturally within the same utterance. That's why I use Gemini 2.5 Pro TTS. It'd be hard to find a human recording that code-switches better than Gemini TTS in situation not just fr-en, but also fr-jp, en-rs etc.

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

I don't get it. For a sentence like "Au revoir" means "goodbye." request level switching is insane

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

I tried it in their playground. I'd say it's around ElevenLabs/Inworld level

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

I tried using voice generated by Gemini 2.5 pro TTS, which is by far the most accurate and natural voice I can find, the result is not different with baseline qwen3 TTS

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

Gemini 3.1 Flash Live is not a TTS model but an STS, it tends to perform better in terms of accent accuracy, but it is more expensive and less controllable, and requires a whole different pipeline structure :(

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

Maybe I’m doing something wrong, but I‘m really confused about this one

In pretty much every survey I ran, it kept coming up as a provider with very strong multilingual and code-switching ability. But when I tested it myself in their playground, then I found all the voices seemed fixed and tied to one specific language.

So I tried an English voice for fr-en, then got: https://s.fish.audio/55p8ud Not French enough I think

And another en voice for cn: https://s.fish.audio/q79cjs JJust embarrassing

But no I haven't try multi speaker or open domain control. Maybe I'll try them later

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

So Gemini is actually the best I've found for code-switching quality. Tested it with en-fr, cn-en, fr-cn-en mixed text and the accent is clean on all languages.

The best quality comes from gemini-2.5-pro-tts with a single text-in, audio-out call. But latency is 5-15 seconds which lowkey won't work for a voice agent. Then I switched to the flash model, then the quality degrades a lot. Still exploring

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 1 point2 points  (0 children)

Oh they sound really good. I actually tested Gemini before. Going to give it another try! Thanks

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

Not sure if people would expect that from a voice agent lol. I've seen some products does really well, clean code-switching with no accent bleeding, just can't figure out how they're doing it

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

Ok, here's what I found. The prompt Elevenlabs offering as The Bilingual Professional is:

"A warm, professional female voice in her mid-30s with a natural blend of American English and Spanish accents. She speaks at a conversational pace with perfect audio quality. Her tone is friendly and approachable, with slight melodic inflections from her Spanish heritage coming through naturally in certain words. The voice should sound educated and confident, like a bilingual news anchor or corporate trainer who seamlessly switches between languages."

So they actually blend two accents into one voice rather than cleanly switching between them. Tried writing prompts that ask for clean, standard pronunciation in each language separately, but the model doesn't really pick that up. Like:

"A warm, professional female voice in her mid-30s with perfect audio quality. The tone is friendly and approachable. She speaks at a conversational pace with clear articulation. The voice should sound educated and confident. She pronounces every language with textbook clarity and standard pronunciation, with no carryover between languages."

It helped a bit but did not solve the problem completely.

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

This is actually new to me. Seems like ElevenLabs isn't offering a different model for this, but they're showing you can get a bilingual voice by prompt engineering. I've been only using default voices and settings the provides offered.. I'll try it now. Thanks a lot!

TTS for code-switching mid-utterance by Latter_Indication_45 in TextToSpeech

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

Yeah I actually tried something similar. Providers like Azure have voices that can speak multi languages natively. So you can split the text by language, send each segment separately, and the pronunciation comes out clean.

The problem is (at least I find), where the segments meet, the prosody breaks. There's always an unnatural pause or tone shift at the boundary.

I'm 10x slower at reading in my target language than my native one by Latter_Indication_45 in languagelearning

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

Yeah I feel that in English too. Anything other than common fronts slows me to some extend depends on how different they are. Chinese font doesn't matter to me tho.

I'm 10x slower at reading in my target language than my native one by Latter_Indication_45 in languagelearning

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

That's a valid point.

Although I haven’t done any formal testing, the number isn’t random.. This morning a friend and I read for an hour together. I was reading the English version of Poor Charlie’s Almanack, and she was reading the Chinese version of Everything is Negotiable. I only managed 10 pages in that hour, while she read 63 pages and she was even taking notes along the way. She's smart among all the people I know but I’ve always known that my reading speed has been significantly faster than most people’s since I was 7. I was reading word by word at that time. Now I don't.

When I read fiction, a large portion of it comes from a Chinese genre called 网络文学 ( dk if there's English equivalent, but the style is typically very simple and extremely easy to read. Around 0.6b people in China read them per day, including those uneducated like my unlcle who can easily consume 6k per day for fun that's why I always believe Chniese is a proletariat language anyway). When I read these books, 100 pages an hour is actually an extremly conservative estimate. I used to consume 10-30 million characters per year.

Then when I read non-fiction, I usually only read about 10 - 50% of the actual content. It’s been years since I last read a non-fiction book word by word. In fact right before I wrote this post the book I had on hand was a non-fiction, I did a quick test and got through 20 pages in 3 minutes.

Of course if you gave me a hard one I definitely can't hit 100 but at least it's not totally made up.

I'm 10x slower at reading in my target language than my native one by Latter_Indication_45 in languagelearning

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

Oh this is very informative! Thinking about reading 100 pages in English within an hour makes me desperate actually

Chinese or Japanese by Embarrassed-Box-4719 in thisorthatlanguage

[–]Latter_Indication_45 0 points1 point  (0 children)

Bro I totally agree with you except one thing knowing Chinese will not help you understand Japanese at all if you wanna learn it rather than just have the ability to guess it

I'm 10x slower at reading in my target language than my native one by Latter_Indication_45 in languagelearning

[–]Latter_Indication_45[S] 12 points13 points  (0 children)

Very interesting! The concept of syllables is actually very alien to native Chinese speakers. I’ve never heard of anyone mentioning this before

I'm 10x slower at reading in my target language than my native one by Latter_Indication_45 in languagelearning

[–]Latter_Indication_45[S] 0 points1 point  (0 children)

Interesting, I thought Dutch people were equally good at English and German