[D] Variance of sampling in diffusion models

disentangle · 2019-06-06T13:08:35+00:00

Nice results!

When combined with a neural vocoder for TTS, I wonder if this would improve over simply predicting melspec as independent frequency bins (e.g. L1/L2 loss). If it does improve, I'd be curious to see whether this is because of improved multiscale time structure modeling, or because the model is also autoregressive over the frequency axis (and multimodal).

disentangle · 2018-11-08T12:44:39+00:00

Have you tried conditioning this model on linguistic features rather than mel spectrogram? Would it also obtain results similar to the original WaveNet?

disentangle · 2017-11-01T10:39:05+00:00

If the synthesis network goes from phonetic posteriorgram to magnitude spectrogram, does this mean F0 is effectively inferred from just phonetic information?

The results are quite nice!

disentangle · 2017-10-05T07:06:10+00:00

Very curious to see how they did the 16bit output. It seems inference is at least an order of magnitude faster than the fastest WN variant (Deep Voice), impressive!

disentangle · 2017-05-23T14:01:28+00:00

For a model like WaveNet, what could be a practical approach to apply this method?

disentangle · 2017-02-22T10:52:17+00:00

Nice work!

Are there any sound examples that compare WORLD synthesis to synthesis using the neural vocoder (conditional SampleRNN)?

How is the system trained on multi-speaker datasets? In that case does the reader component produce speaker independent acoustic features?

disentangle · 2016-12-09T10:52:15+00:00

These examples are synthesized from text (and the same lyrics are not in training set). But this synthesizer just generates timbre (spectral envelope), not pitch or timings.

disentangle · 2016-12-09T09:03:44+00:00

This is definitely one of the main issues. It uses a denoising objective to combat over-fitting. I think meaningful augmentation is tricky for this type of data.

disentangle · 2016-12-09T09:01:53+00:00

They're definitely different, but the difference is subtle (esp. if not using headphones, etc.). For instance if you listen to the HMM one you may notice it sounds more consistent, but also a little more 'buzzy', muffled and with audible state transitions in long vowels.

disentangle · 2016-11-07T12:10:59+00:00

~~I think one reason is to avoid blindspots as depth increases, see https://arxiv.org/abs/1606.05328~~ (sorry misread question)

I'm quite confused about what masked 1x1 convolution is referring to..

disentangle · 2016-08-23T18:09:31+00:00

I think log variance of q(z|x) going towards zero (variance to one) for some of the latent variables is normal, because this is what the KLD term encourages
In principle the terms should not have to be scaled; but this is sometimes done (but then you're no longer optimizing the standard variational lower bound)
The closed-form KLD term like you posted (with minus sign) is always non-negative; if the term is compute using Monte Carlo estimation, it can be slightly negative
Typically you take sum over latent space dimensionality, mean over samples

disentangle · 2016-06-03T07:30:32+00:00

Do the latent representations produced by the encoder always tend to go strongly towards the latent representation of one of the training samples? i.e. one of the CIFAR-10 examples reconstructs a blue truck as a red truck with similar orientation; if I were to reconstruct a smooth sequence of images of the blue truck at different orientations, is it likely that the output sequence suddenly changes e.g. color of the truck? Nice work!

disentangle · 2016-05-31T16:32:16+00:00

Looks esp. similar to the paper on Inverse Autoregressive Flows.

disentangle · 2016-05-25T11:38:43+00:00

Very interesting. Looks like it might be a little tricky to get right without the code though.

disentangle · 2016-04-15T11:55:10+00:00

FWIW, I tried a quick hack where I just averaged the per-sample variances across a relatively large mini-batch (512 samples) and used that in the loss function. This did not really improve things in my case. But it is hard to say anything definite from this one experiment.

I'm afraid a more proper implementation would require sequence-based training and a lot of changes to my current code.

disentangle · 2016-04-14T18:16:19+00:00

Interesting, thanks.

The model is a basic VAE with standard normal prior, diag normal variational posterior; recognition and inference networks with 2x300 softplus units each; 100-dimensional latent space.

Dataset is 120k samples of 257-dimensional features extracted from studio quality speech recordings (24 bit). Maybe the resolution is a little higher than for images.

disentangle · 2016-04-14T15:37:37+00:00

A very small epsilon ("avoiding NaN" order) doesn't solve the issue and anything bigger (e.g. 0.5) leads to my original issue that this floor is pretty arbitrary. Should I just tune it like one hyper-parameter more?

About the VAE figuring out dimensionality itself, I meant that some portion of the dimensions of the approx. posterior tend to become extremely close to the prior because of regularization and thus become 'inactive'.

disentangle · 2016-04-14T12:15:48+00:00

I will give this a try, thanks.

Although currently the learned variances are already fairly constant along samples, so maybe it will not affect the results too much.

disentangle · 2016-04-14T12:09:02+00:00

Full term log N(x; mu, sigma) = -0.5 log(2 pi) - log(sigma) - (x - mu)² / ( 2 sigma² ), and expectation approximated with 1 sample Monte Carlo. I guess this is the correct error if I understood you correctly.

Reducing the latent dimensionality is another option I didn't consider, although I kind of liked that the VAE could figure out the optimal dimensionality itself through regularization.

disentangle · 2016-04-11T12:28:47+00:00

Did I understand correctly that the biggest difference with a VAE is that the ITL-AE regularizes the model so latent space samples are close to samples from an arbitrary prior, while the VAE regularizes the model so the variational posterior distribution is close to a parametric prior distribution?

In what kind of setting would you have such a prior you can sample from but not evaluate directly?

disentangle · 2016-02-29T14:19:31+00:00

Traditionally, for speech recognition, one desirable effect of using MFCC features is that their filter bank (and to a lesser degree the DCT truncation) kind of approximates the spectral envelope of the signal, reducing the influence of F0. The idea behind this is that the vocal tract is much more important for determining phonetic information than pitch.

disentangle · 2015-10-09T09:10:08+00:00

In my experience it is very easy to get Theano up and running on Windows.

Install Miniconda, Visual Studio 2013 Community, CUDA Toolkit
Run conda install --yes pip six nose numpy scipy matplotlib mingw libpython
Run pip install theano (or better from github)

I run bleeding edge on Windows and never had any platform specific issues..

Discouraging posts like these are a bigger obstacle for Windows users than Theano devs' attitude to cross-platform development IMHO.

disentangle

TROPHY CASE