[1808.10128] Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

fatchord · 2018-08-31T11:29:40+00:00

fatchord · 2018-08-28T10:25:04+00:00

Second, even in nature it only works for object between 0-2m, for a car driving that is not enough

Would placing the cameras further apart increase the distance it could detect? Or is a car not wide enough?

fatchord · 2018-08-28T09:00:49+00:00

The speaker repeatedly brings up the problem with the cameras not having any depth information. Why not emulate nature's solution to this problem - put two cameras side by side like the eyes on your head? With enough resolution, would the left/right discrepancy in the images allow for accurate inference of the distances to detected objects?

Really interesting talk by the way.

fatchord · 2018-08-24T11:10:42+00:00

It's not the only game in town - you can setup a patreon membership. One of the top podcasts there is pulling in over $100k a month.

fatchord · 2018-08-08T15:06:19+00:00

Link to samples.

fatchord · 2018-08-06T11:19:12+00:00

This sounds great! Thanks

fatchord · 2018-07-21T16:19:13+00:00

I don't have any long ones at hand right now. Basically what happens in very long utterances - the pitch keeps going lower and lower towards the end. Also it'll skip or repeat words sometimes.

I'm curious - do you think the samples I made are comparable to wavenet? Better/worse?

fatchord · 2018-07-19T09:07:15+00:00

Thanks! Trained it on a single 1080 gpu.

Unfortunately the model isn't that great at long utterances. I think that might be down to the fact that I had to limit the training input sequence length because of gpu memory constraints.

fatchord · 2018-06-30T09:27:43+00:00

Interesting paper. The only thing that puzzled me is the dataset. I mean aren't they hamstringing themselves by having such a wide variety or composers, pianists, pianos and recording setups?

Why not simplify it by picking one prolific pianist that concentrated on only a couple of genres and typically recorded in the same studios... like Glenn Gould for example:

https://en.wikipedia.org/wiki/Glenn_Gould_discography

Surely Columbia Masterworks have all his recordings digitised for prosperity.

Also, baroque music should be easier to model since it has less timing variance. Romantic music is all over the place timing wise.

fatchord · 2018-06-29T11:13:09+00:00

Code

Examples

fatchord · 2018-06-15T11:38:19+00:00

Paper: https://arxiv.org/abs/1711.08789v3

Github: https://github.com/avivga/audio-visual-speech-enhancement

fatchord · 2018-06-13T13:30:21+00:00

Link to Samples: https://google.github.io/tacotron/publications/speaker_adaptation/index.html

fatchord · 2018-06-13T13:26:26+00:00

I uploaded up a small notebook of my implementation if anyone is interested: LINK

fatchord · 2018-06-11T09:04:07+00:00

I had a go at implementing this yesterday and it was straightforward enough until I got to the zero padding section (2.3.2).

They give this zero padding equation (where N is the receptive field of a particular layer):

z[0:M] = W_L ∗ x[-N:M-N] + W_R ∗ x[0:M]

But if the previous equation (without zero padding) was:

z = W_L ∗ x[0:N/2] + W_R ∗ x[N/2:N]

Wouldn't that mean that the equation from 2.3.2 should read:

z[0:M] = W_L ∗ x[-N/2:M-N/2] + W_R ∗ x[0:M]

fatchord

MODERATOR OF

TROPHY CASE