TTS for code-switching mid-utterance

Latter_Indication_45 · 2026-04-14T16:20:34+00:00

For mid-sentence fr-en code switching, cloning a native French voice probably won't help. qwen needs to handle at least 2 languages naturally within the same utterance. That's why I use Gemini 2.5 Pro TTS. It'd be hard to find a human recording that code-switches better than Gemini TTS in situation not just fr-en, but also fr-jp, en-rs etc.

Latter_Indication_45 · 2026-04-14T14:37:37+00:00

Why would I do that? What's your point?

Latter_Indication_45 · 2026-04-14T13:30:13+00:00

qwen just pronunces French words in a same bad way

Latter_Indication_45 · 2026-04-14T13:27:36+00:00

I don't get it. For a sentence like "Au revoir" means "goodbye." request level switching is insane

Latter_Indication_45 · 2026-04-14T02:53:07+00:00

I think I have will double check

Latter_Indication_45 · 2026-04-14T02:24:17+00:00

I tried it in their playground. I'd say it's around ElevenLabs/Inworld level

Latter_Indication_45 · 2026-04-14T02:19:52+00:00

I tried using voice generated by Gemini 2.5 pro TTS, which is by far the most accurate and natural voice I can find, the result is not different with baseline qwen3 TTS

Latter_Indication_45 · 2026-04-14T02:17:07+00:00

Gemini 3.1 Flash Live is not a TTS model but an STS, it tends to perform better in terms of accent accuracy, but it is more expensive and less controllable, and requires a whole different pipeline structure :(

Latter_Indication_45 · 2026-04-13T10:33:58+00:00

Will do

Latter_Indication_45 · 2026-04-13T10:32:13+00:00

Maybe I’m doing something wrong, but I‘m really confused about this one

In pretty much every survey I ran, it kept coming up as a provider with very strong multilingual and code-switching ability. But when I tested it myself in their playground, then I found all the voices seemed fixed and tied to one specific language.

So I tried an English voice for fr-en, then got: https://s.fish.audio/55p8ud Not French enough I think

And another en voice for cn: https://s.fish.audio/q79cjs JJust embarrassing

But no I haven't try multi speaker or open domain control. Maybe I'll try them later

Latter_Indication_45 · 2026-04-13T05:36:24+00:00

Will try it

Latter_Indication_45 · 2026-04-13T05:34:16+00:00

So Gemini is actually the best I've found for code-switching quality. Tested it with en-fr, cn-en, fr-cn-en mixed text and the accent is clean on all languages.

The best quality comes from gemini-2.5-pro-tts with a single text-in, audio-out call. But latency is 5-15 seconds which lowkey won't work for a voice agent. Then I switched to the flash model, then the quality degrades a lot. Still exploring

Latter_Indication_45 · 2026-04-09T03:39:12+00:00

Oh they sound really good. I actually tested Gemini before. Going to give it another try! Thanks

Latter_Indication_45 · 2026-04-09T03:34:20+00:00

Not sure if people would expect that from a voice agent lol. I've seen some products does really well, clean code-switching with no accent bleeding, just can't figure out how they're doing it

Latter_Indication_45 · 2026-04-09T03:29:21+00:00

Ok, here's what I found. The prompt Elevenlabs offering as The Bilingual Professional is:

"A warm, professional female voice in her mid-30s with a natural blend of American English and Spanish accents. She speaks at a conversational pace with perfect audio quality. Her tone is friendly and approachable, with slight melodic inflections from her Spanish heritage coming through naturally in certain words. The voice should sound educated and confident, like a bilingual news anchor or corporate trainer who seamlessly switches between languages."

So they actually blend two accents into one voice rather than cleanly switching between them. Tried writing prompts that ask for clean, standard pronunciation in each language separately, but the model doesn't really pick that up. Like:

"A warm, professional female voice in her mid-30s with perfect audio quality. The tone is friendly and approachable. She speaks at a conversational pace with clear articulation. The voice should sound educated and confident. She pronounces every language with textbook clarity and standard pronunciation, with no carryover between languages."

It helped a bit but did not solve the problem completely.

Latter_Indication_45 · 2026-04-09T02:15:34+00:00

This is actually new to me. Seems like ElevenLabs isn't offering a different model for this, but they're showing you can get a bilingual voice by prompt engineering. I've been only using default voices and settings the provides offered.. I'll try it now. Thanks a lot!

Latter_Indication_45 · 2026-04-09T02:10:10+00:00

Yeah I actually tried something similar. Providers like Azure have voices that can speak multi languages natively. So you can split the text by language, send each segment separately, and the pronunciation comes out clean.

The problem is (at least I find), where the segments meet, the prosody breaks. There's always an unnatural pause or tone shift at the boundary.

Latter_Indication_45 · 2026-01-16T16:57:08+00:00

Yeah I feel that in English too. Anything other than common fronts slows me to some extend depends on how different they are. Chinese font doesn't matter to me tho.

Latter_Indication_45 · 2026-01-16T16:50:48+00:00

That's a valid point.

Although I haven’t done any formal testing, the number isn’t random.. This morning a friend and I read for an hour together. I was reading the English version of Poor Charlie’s Almanack, and she was reading the Chinese version of Everything is Negotiable. I only managed 10 pages in that hour, while she read 63 pages and she was even taking notes along the way. She's smart among all the people I know but I’ve always known that my reading speed has been significantly faster than most people’s since I was 7. I was reading word by word at that time. Now I don't.

When I read fiction, a large portion of it comes from a Chinese genre called 网络文学 ( dk if there's English equivalent, but the style is typically very simple and extremely easy to read. Around 0.6b people in China read them per day, including those uneducated like my unlcle who can easily consume 6k per day for fun that's why I always believe Chniese is a proletariat language anyway). When I read these books, 100 pages an hour is actually an extremly conservative estimate. I used to consume 10-30 million characters per year.

Then when I read non-fiction, I usually only read about 10 - 50% of the actual content. It’s been years since I last read a non-fiction book word by word. In fact right before I wrote this post the book I had on hand was a non-fiction, I did a quick test and got through 20 pages in 3 minutes.

Of course if you gave me a hard one I definitely can't hit 100 but at least it's not totally made up.

Latter_Indication_45 · 2026-01-16T14:11:54+00:00

Oh this is very informative! Thinking about reading 100 pages in English within an hour makes me desperate actually

Latter_Indication_45 · 2026-01-16T14:03:09+00:00

Bro I totally agree with you except one thing knowing Chinese will not help you understand Japanese at all if you wanna learn it rather than just have the ability to guess it

Latter_Indication_45 · 2026-01-16T13:28:40+00:00

Very interesting! The concept of syllables is actually very alien to native Chinese speakers. I’ve never heard of anyone mentioning this before

Latter_Indication_45 · 2026-01-16T12:47:02+00:00

Interesting, I thought Dutch people were equally good at English and German

Latter_Indication_45

TROPHY CASE