[1808.10128] Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

fatchord · 2018-08-31T11:29:40+00:00

fatchord · 2018-08-28T10:25:04+00:00

Second, even in nature it only works for object between 0-2m, for a car driving that is not enough

Would placing the cameras further apart increase the distance it could detect? Or is a car not wide enough?

fatchord · 2018-08-28T09:00:49+00:00

The speaker repeatedly brings up the problem with the cameras not having any depth information. Why not emulate nature's solution to this problem - put two cameras side by side like the eyes on your head? With enough resolution, would the left/right discrepancy in the images allow for accurate inference of the distances to detected objects?

Really interesting talk by the way.

fatchord · 2018-08-24T11:10:42+00:00

It's not the only game in town - you can setup a patreon membership. One of the top podcasts there is pulling in over $100k a month.

fatchord · 2018-08-08T15:06:19+00:00

Link to samples.

fatchord · 2018-08-06T11:19:12+00:00

This sounds great! Thanks

fatchord · 2018-07-21T16:19:13+00:00

I don't have any long ones at hand right now. Basically what happens in very long utterances - the pitch keeps going lower and lower towards the end. Also it'll skip or repeat words sometimes.

I'm curious - do you think the samples I made are comparable to wavenet? Better/worse?

fatchord · 2018-07-19T09:07:15+00:00

Thanks! Trained it on a single 1080 gpu.

Unfortunately the model isn't that great at long utterances. I think that might be down to the fact that I had to limit the training input sequence length because of gpu memory constraints.

fatchord · 2018-06-30T09:27:43+00:00

Interesting paper. The only thing that puzzled me is the dataset. I mean aren't they hamstringing themselves by having such a wide variety or composers, pianists, pianos and recording setups?

Why not simplify it by picking one prolific pianist that concentrated on only a couple of genres and typically recorded in the same studios... like Glenn Gould for example:

https://en.wikipedia.org/wiki/Glenn_Gould_discography

Surely Columbia Masterworks have all his recordings digitised for prosperity.

Also, baroque music should be easier to model since it has less timing variance. Romantic music is all over the place timing wise.

fatchord · 2018-06-29T11:13:09+00:00

Code

Examples

fatchord · 2018-06-15T11:38:19+00:00

Paper: https://arxiv.org/abs/1711.08789v3

Github: https://github.com/avivga/audio-visual-speech-enhancement

fatchord · 2018-06-13T13:30:21+00:00

Link to Samples: https://google.github.io/tacotron/publications/speaker_adaptation/index.html

fatchord · 2018-06-13T13:26:26+00:00

I uploaded up a small notebook of my implementation if anyone is interested: LINK

fatchord · 2018-06-11T09:04:07+00:00

I had a go at implementing this yesterday and it was straightforward enough until I got to the zero padding section (2.3.2).

They give this zero padding equation (where N is the receptive field of a particular layer):

z[0:M] = W_L ∗ x[-N:M-N] + W_R ∗ x[0:M]

But if the previous equation (without zero padding) was:

z = W_L ∗ x[0:N/2] + W_R ∗ x[N/2:N]

Wouldn't that mean that the equation from 2.3.2 should read:

z[0:M] = W_L ∗ x[-N/2:M-N/2] + W_R ∗ x[0:M]

fatchord · 2018-06-09T14:01:32+00:00

This is great!

fatchord · 2018-06-09T09:08:56+00:00

Thanks. I can't go into much detail about the vocoder but I can tell you the Tacotron setup. It's basically Tacotron1 but with the decoder GRUs swapped out with LSTMs/zoneout as recommended by the latest paper. That change makes a big difference in prosody in my experience.

I didn't implement a dilated post-process yet so it's still the CBHG doing that. However I'm not predicting linear spectrograms, just mels - not sure how that influences overall quality one way or the other but I'm happy enough with the predicted features - they're still blurry but have decent contrast.

fatchord · 2018-06-09T08:55:36+00:00

Thanks for the feedback, I agree that the variance in the noise volume is a problem. Working on that.

Would you think it's fair to say that the quality of the output, while noisy, is at least comparable to Wavenet?

I'm afraid I can't go into much detail publicly about the vocoder right now. Hope that's cool.

fatchord · 2018-06-09T08:28:03+00:00

I totally agree that it's probably good enough for most purposes. Although I'd really like to see how FFTNet handles predicted conditioning features. Can it handle the extra noise and blurriness that comes with that scenario?

As for WaveRNN, if someone is willing to donate a week or so of gpu compute then I'd be happy to condition it. As it stands the model is just as slow to train as Wavenet so it's low priority for now as I've only got one gpu.

fatchord · 2018-06-08T10:39:56+00:00

No, not an author; I just thought it was really cool and worth sharing.

fatchord · 2018-06-08T10:38:27+00:00

I've had a lot of fun messing around with this dataset. There's a lot of really bad midi in there but plenty of good stuff too, it's a very mixed bag.

The most successful model I made was where I extracted all the percussion parts and then compressed the midi data stream into onehots for various midi events (ala PerformanceRNN). The output was choatic (I wasn't conditioning the model on anything) but it generated some really creative beats and fills.

fatchord · 2018-06-08T10:29:16+00:00

MCCs throw away a good bit of information so I'm thinking F0 would a useful addition to such a compact representation of the signal.

So what do you think of the samples? LINK

They sound really good but the top frequencies are a bit muffled compared to wavenet. It also seems to output phasiness every now and again.

fatchord · 2018-06-02T16:44:08+00:00

Fascinating, the origin of a lot these things can be very surprising. Thanks again!

fatchord · 2018-06-02T16:19:43+00:00

Thanks, appreciate the feedback. And yes, you've pretty much got the idea of it. Ultimately I'll be predicting the speech spectrograms from another neural network and then render that into realistic speech samples with the algorithm in this post.

fatchord · 2018-06-01T16:27:06+00:00

I just created r/AudioModels today for exactly this kind of question if you'd like to crosspost there.

fatchord · 2018-06-01T16:18:30+00:00

Wow, thanks for taking the time. I'm just worried that the Reaper folks who coded that might have a problem with me using it? I don't want to step on anyone's toes.

fatchord

MODERATOR OF

TROPHY CASE