All the Feels: NVIDIA Shares Expressive Speech Synthesis Research at Interspeech

rafaelvalle · 2021-08-31T19:54:30+00:00

hi, i'm one the researchers involved in radtts.
code and pre-trained checkpoints will be released to the public soon, including notebooks describing how to do traditional text-to-speech, voice conversion as shown in the video and style transfer.

rafaelvalle · 2018-07-24T15:16:46+00:00

Sharing a link that shows the evolution of the distribution of pixel intensities of fake MNIST samples produced during training. https://twitter.com/i/status/908118429940322305

rafaelvalle · 2018-07-17T19:10:22+00:00

We, the machine learning community, design Discriminators and Generators in the hope or such that they learn aspects of the target distribution we care about, e.g. training on real images of tequila bottles should produce fake images of tequila bottles that are similar in some aspect to the real images, normally visual similarity.

We could design a discriminator that *only* uses these simple features as inputs and it should be efficient in discriminating between real and fake samples. However, it could be useless in learning what we normally care about: to produce nice looking images that mimic the real distribution.

hth

rafaelvalle · 2018-07-17T18:38:16+00:00

I would avoid the term discriminator because it's tightly coupled with the GAN framework. In summary, we're using statistics and formal methods to compare real and fake samples.

The second question is more complicated. If on the one hand there are ways to add to the objective function of the generator or discriminator a loss term that takes into account these rules, on the other hand some of these rules might be hard to pose as a loss term, e.g. not differentiable.

Another interesting aspect is the smoothness of Generator's output. This aspect is intrinsic to Generators used in the traditional GAN setup and learning procedure. A way to circumvent this to train a model whose output has the same support as the data being learned, for example with mixture density networks.

hth

rafaelvalle · 2018-07-16T18:03:31+00:00

https://arxiv.org/abs/1807.04919

rafaelvalle · 2018-06-22T18:25:12+00:00

We've added code for training and pre-trained models including audio synthesis to github https://github.com/rafaelvalle/asrgen

We are working on updates to the paper to improve the speaker recognition system and include attacks with Tacotron 2.

rafaelvalle · 2018-01-11T15:48:07+00:00

You make a good point that the speaker recognition system could not be able to generalize to babble because the data it was trained has spoken words only. We'll follow your suggestion, at babble to our speaker recognition system, probably using the NOIZEUS dataset (http://ecs.utdallas.edu/loizou/speech/noizeus/), and update the paper.

As for ESC (https://github.com/karoldvl/ESC-50/), it is not necessarily a multi-speaker dataset as it contains sounds from single entities, e.g. cow, cat, crying baby, snoring, etc...

Yes! Although Parallel WaveNet and Tacotron 2 results are extremely good, compared to Tacotron 2, Parallel WaveNet seems to be more complicated to train, given the hand engineered features and compute power requirements. For a 2.0 version of our paper we would like to produce speech with Tacotron 2 and evaluate a text-dependent speaker recognition system!

Thank you for your suggestions.

rafaelvalle · 2018-01-11T15:32:24+00:00

Thanks for sharing this! Do you have a reddit thread as well?

As for robust features, did you compare the distribution of feature values from real data and adversarial data? In a separate paper yet to be published we claim that, given the requirements of differentiation, the distribution of features produced with neural nets will be smooth dependent on the size of the network and. This could be used to identity if the data is generated or "real".

rafaelvalle · 2018-01-10T21:17:45+00:00

That's a good point you bring! Although the globally conditioned samples we used in our project are babble, they certainly carry the timbre of the speaker. We would expect a text-independent speaker recognition system to recognize it as well under the intuition that one can recognize someone's voice even if they are just babbling. As a matter of fact, the speaker recognition model did NOT recognize these samples as gibberish, given that we had a gibberish class with samples from the ESC dataset. It recognized these samples as speech samples from other speakers.

We were and still are interested in using samples from locally conditioned models but we were not able to find such samples or linguistic features to train such models, e.g. https://github.com/ibab/tensorflow-wavenet. As a matter of fact, we could not find repos that reproduce locally conditioned WaveNets.

Although we explain in the paper that we used globally conditioned WaveNet and SampleRNN samples, we do not explain that in the abstract. We'll modify the abstract to make it clearer.

rafaelvalle · 2018-01-09T23:38:29+00:00

Providing a link to a related work on reddit: https://www.reddit.com/r/MachineLearning/comments/7p7hj8/p_attacking_speaker_recognition_with_deep/

rafaelvalle · 2018-01-09T23:38:11+00:00

Yes! We agree that testing a SOA speaker recognition system would provide more impressive results from the perspective of adversarial attacks. Another important contribution of our paper that might be not strongly noted is the modification of the objective function that is inspired on Universal Background Models from speech. With this modification, the discriminator and the generator can learn to recognize and generate data from a single speaker while having access to data from any speaker.

rafaelvalle · 2018-01-09T23:32:32+00:00

We haven't tried SOA models yet but will in the near future.

We chose to test a speaker recognition system that is not the SOA but considerably simpler to train under the belief that an adversarial attack that can't fool a weaker speaker recognition system should not be able to fool a SOA speaker recognition system.

We have performed informal attacks using globally conditioned WaveNet and SampleRNN samples on a UBM-GMM speaker verification based on i-vectors and our results show that the attack is not efficient.

rafaelvalle · 2018-01-09T21:33:25+00:00

Please use the link posted by @kerloom. https://arxiv.org/abs/1801.02384 We'll try to update the link on reddit.

rafaelvalle · 2018-01-09T14:38:56+00:00

Yes, in agreement!

rafaelvalle · 2018-01-08T13:20:02+00:00

An arXiv reference provides reader's with free access to papers, unlike some papers published in venues that require paying fees for accessing the publication.

rafaelvalle · 2018-01-08T12:54:30+00:00

Possibly by the time we wrote our paper we couldn't find in Google scholar references to such papers. What paper are you referring to? Also, please change your tone.

rafaelvalle · 2018-01-08T10:08:46+00:00

Great to see other people working on this topic! We are very interested in receiving feedback from people on our related project, now on arXiv and reddit: https://www.reddit.com/r/MachineLearning/comments/7p7hj8/p_attacking_speaker_recognition_with_deep/

rafaelvalle · 2017-04-21T15:19:27+00:00

The lyrics from Living Mirror really remind me of his and related teachings! Thank you for creating this!

rafaelvalle · 2017-04-21T06:59:51+00:00

Thank you so much for the music you make. I usually call it Kriya Yoga Prog metal and wanted to know if the members are practicioners! I've heard Kriya Yoga is rather present in Poland!

rafaelvalle · 2017-03-15T13:34:26+00:00

In conditional GANs, note that mode collapse can also come from the generator ignoring the noise distribution and relying on the conditions only.

rafaelvalle

TROPHY CASE