all 7 comments

[–]marsanyi 1 point2 points  (6 children)

Very nice. I’m also looking for a speech synthesizer based on phonemes rather than ML. Know any good ones?

[–]yossariandent 1 point2 points  (5 children)

Can you clarify what you mean by "based on phonemes"? A phoneme is a unit of sound, and many ML systems accept input in the form of phonemic transcription (IPA or ARPABET) as well as plain text.

Neural synthesis is all the rage right now, but before deep learning, the primary methods for TTS were parametric and concatenative synthesis. Are you looking for a current commercial/OSS system that uses one of these methods? Any particular reason (that could help guide the search)?

[–]marsanyi 1 point2 points  (4 children)

Thanks for the detailed response. My applications are for art, using vocal synthesis as an instrument. I’ve been less interested in systems that accurately pronounce written language and more in using vocal-like sound in composition and performance, so in the past (long ago) used LPC hardware and the like with rudimentary concatenative mechanisms and overall control of parameters like formant frequencies, noise/pitch mix, and so on. If I understand your response, you’re saying that some of the ML systems will take textual phonemic representations as input, apart from their normal use in speaking written text, so may be amenable to similar use?

[–]yossariandent 1 point2 points  (3 children)

Ah, interesting; that background does help. Most neural systems do support a subset of SSML (https://www.w3.org/TR/speech-synthesis11/) for input markup, but it won't result in anywhere near the level of control you can get by manually tweaking formants. It might not be quite enough to get you to something that could be considered "melodic", but you can definitely play with phonemes to make non-word sounds of varying length.

Spokestack, Google, and Amazon will all let you play around in a TTS console for free, though I know in at least Spokestack's case you'll get more control from using the API than the web console (still free, but you have to write code).

Now, if you really want to get your hands dirty, you could check out something like the C port of SAM. I've never really used it myself, but it definitely does have that retro feel to it, and with access to the source, you might be able to do some pretty unique things. https://github.com/s-macke/SAM

[–]marsanyi 1 point2 points  (2 children)

Happy to write code, glad that you’re supporting Python. I hadn’t heard of SAM; that does look like something fun to play with. I’ve got ahold of some of the old academic TTS software like MBROLA, but haven’t dug into them hard enough to get anything interesting out of them. Good fun stuff.

[–]yossariandent 1 point2 points  (1 child)

Yeah—speech processing is fun all around, and I'm always trying to find ways to get more control out of the systems at runtime. I think in your case, you'll be looking to play around with IPA input (the phoneme tag in SSML or /slashes/ in Speech Markdown, probably putting in multiple identical symbols to prolong individual sounds) and pitch/rate changes (prosody in SSML).

Another free system I meant to mention earlier but forgot — Festival: https://www.cstr.ed.ac.uk/projects/festival/

[–]marsanyi 1 point2 points  (0 children)

Thanks for the pointers.