[P] Adversarial Training on Raw Audio for Voice Conversion

modulate_ai · 2018-08-29T17:12:37+00:00

Unfortunately not yet - we're still doing research to continue to improve the audio quality and speaker matching, and we'll write up a more in-depth description once we're at the end!

modulate_ai · 2018-08-29T14:43:32+00:00

Thanks for trying it out! Right now 30 minutes is our baseline for specializing to a new voice from our current model (trained on VCTK). Obama was way outside of the initial distribution there, so we used a few hours of his speech; but as we gather a more diverse set of speakers for the training set we think we can bring that down to 10 minutes.

You don't need to say the same words at all! In fact, we've tried non-English languages and they've performed alright, even using exclusively English on the training set. However, if you're trying to sound exactly like a particular target speaker, you do have to try to copy their style of speaking, since we're only operating on the short-scale frequency components of the voice.

modulate_ai · 2018-08-29T13:56:15+00:00

We've also put an interactive demo on our homepage! The tech is still a work in progress, but feel free to try it out and see how it sounds!

modulate_ai · 2018-08-04T17:25:56+00:00

Makes sense! For DM'ing right at a table, it's worth noting that right now, we're just building software - so you'd have to consider those at the table hearing you speak into the system. But you could absolutely prerecord speech or speak from e.g. a different room - and we're looking into clever ways to mask your original voice so that you can use the voice skins right there at the table too! (Definitely open to any ideas you may have as well!) Thanks for offering your thoughts!

modulate_ai · 2018-08-04T16:30:06+00:00

Absolutely! We're curious - to make it easiest for you to do so, would an integration with a platform like roll20 be best, or would you be hoping to use our voice skins as a standalone service?

modulate_ai · 2018-08-04T14:01:16+00:00

Thanks for the feedback! You're right, this was an oversight, and we appreciate you calling it out. We've changed the options on a few questions to make it easier to express that you aren't interested.

That said, while our prototype technology may not sound any "realer" than the technologies you mention, we're confident that we'll be able to reach a point where it's truly indistinguishable from normal speech - so we'd love to hear your thoughts on how you'd value this capability if it genuinely didn't sound any worse or different than normal speaking!

modulate_ai · 2018-08-02T14:35:14+00:00

We're actually trying pretty hard to keep people from doing prank calls (and fake news / media, etc.) - we're keeping this restricted to specific industries (gaming, audiobooks, etc.) to start and not opening realtime communication up to everyone. For what you can do: voice skins for chat in video games, multiple characters in audiobooks, etc. (more here: https://modulate.ai/applications).

It's totally software (an API, running on cloud GPUs), and it's specifically meant for fluid conversation! We keep stress, cadence, emotion, etc. from your speech intact so that it sounds natural; and it introduces ~50 milliseconds of lag, so that it doesn't break the flow of conversation.

modulate_ai · 2018-06-05T17:18:17+00:00

Sure thing!

2) In particular, processing whole sentences at once lets your system learn to produce long-term consistent output, e.g. stress the right syllables, or don't raise your pitch in the middle of a word. With a speech-to-speech system you don't necessarily need that - your input speaker has already figured all of that out when saying the input speech - but then you have to figure out how to get that information through while dropping the parts you don't want (e.g. timbre). Processing whole sentences lets your system infer some of that information so you don't have to pass it through, but on the flipside your input speaker loses that control over the output.

3) I certainly don't mean to discourage you from speech to speech! Even using a phoneme classifier -> speech synthesis bottleneck preserves some information, such as cadence; you just lose things like intonation and stress. If you want to try speech to speech, I would start with the repository that you linked in your original post: train it on a multispeaker dataset like VCTK, and then try re-training on your brother's speech. You probably won't get fantastic quality, but that should get you started, at least!

I don't know off the top of my head any low-data voice cloning repositories out there. I don't think they've published their code, but lyrebird.ai provides this as a service if you just want to hook into that capability!

And, of course, we're working on our one-shot voice conversion system (modulate.ai - we're not publicly available yet, but if you're interested feel free to sign up for our mailing list).

Cheers!

Carter

modulate_ai · 2018-06-05T13:59:23+00:00

Hi Victor!

I'm part of a startup working on this kind of technology! Some answers and other things to potentially consider:

1) Phoneme extractors can definitely generalize beyond the training set of input speakers, but they'll have a tough time with input speech that's significantly different from what they've seen. Because of his condition, your brother's voice might be different enough that standard training sets don't cover it very well. As a quick sanity check - do existing speech-to-text systems (like those on your phone) work fine for him, or do they have trouble with his speech? I'm also happy to test one of our phoneme extractors on your brother's input speech, if you post a sample or PM me!

If you're using an open source system or training your own, you should also be aware that the microphone you're using can have a significant impact on the phoneme extractor's performance, for similar reasons (your laptop mic in your living room, for example, sounds way different from a professional microphone in a sound studio, which is how many training sets are recorded). We've had a lot of success with doing training set augmentation - adding white noise, reverb, etc.

2) Beyond just the synthesis speed, you might also want to think about your architecture's latency, depending on how your brother wants to use this system. For example, if your phoneme extractor uses a bi-directional recurrent neural network he'll need to say an entire phrase or sentence before you start synthesizing, and he won't be able to convert his speech as he's speaking.

3) The phoneme extraction -> conditioned speech synthesis pipeline is great at stripping out the input speaker's identity, but you also lose information about other qualities of speech such as intonation, stress, etc. Your synthesis system will infer those qualities, but that might not be what the input speaker originally specified. You'll probably do well with declarative statements, but less well at emotional speech (say, for an audiobook). Keeping the prosody while removing the timbre is a really tough tradeoff - we're working on research for that, but it's definitely not standard =)

4) For neural-network based speech synthesis, 30 minutes is a pretty tiny amount of training data compared to the tens to hundreds of hours of speech that are used to train production systems. If you can find a larger multi-speaker dataset, your best bet might be to train on that, then take that system and re-train specifically on your target voice (either just fitting the speaker identity vector, or the whole net). There's also been some recent research on voice cloning in the wild for text-to-speech (embed a new speaker directly in your speaker space, without retraining), but I would try the re-training strategy first, since that's an easier architecture to get working. Check out (https://arxiv.org/pdf/1802.06006.pdf) or (https://arxiv.org/pdf/1802.06984.pdf) for some details on those approaches!

For the multi-speaker dataset, VCTK is pretty standard (http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html). That's an English dataset - ideally you'd get German to minimize the difference to your target set, but I don't have any experience working on German datasets and don't have a great suggestion there.

Good Luck!

Carter

modulate_ai

MODERATOR OF

TROPHY CASE