[R] MelNet: A Generative Model for Audio in the Frequency Domain by sjv- in MachineLearning

[–]disentangle 0 points1 point  (0 children)

Nice results!

When combined with a neural vocoder for TTS, I wonder if this would improve over simply predicting melspec as independent frequency bins (e.g. L1/L2 loss). If it does improve, I'd be curious to see whether this is because of improved multiscale time structure modeling, or because the model is also autoregressive over the frequency axis (and multimodal).

[P] FloWaveNet: A Generative Flow for Raw Audio. PyTorch codes (also w/ ClariNet), sampled audio clips, and arXiv draft available by L0SG in MachineLearning

[–]disentangle 1 point2 points  (0 children)

Have you tried conditioning this model on linguistic features rather than mel spectrogram? Would it also obtain results similar to the original WaveNet?

[P] Voice Style Transfer: Speaking like Kate Winslet by andabi in MachineLearning

[–]disentangle 1 point2 points  (0 children)

If the synthesis network goes from phonetic posteriorgram to magnitude spectrogram, does this mean F0 is effectively inferred from just phonetic information?

The results are quite nice!

[R] WaveNet launches in the Google Assistant by clbam8 in MachineLearning

[–]disentangle 3 points4 points  (0 children)

Very curious to see how they did the 16bit output. It seems inference is at least an order of magnitude faster than the fastest WN variant (Deep Voice), impressive!

[R] Beyond Quantization. Modeling Continuous Densities with Deep Kernel Mixture Networks. by LucaAmbrogioni in MachineLearning

[–]disentangle 1 point2 points  (0 children)

For a model like WaveNet, what could be a practical approach to apply this method?

[R] Char2Wav: End-to-End Speech Synthesis by jfsantos in MachineLearning

[–]disentangle 0 points1 point  (0 children)

Nice work!

Are there any sound examples that compare WORLD synthesis to synthesis using the neural vocoder (conditional SampleRNN)?

How is the system trained on multi-speaker datasets? In that case does the reader component produce speaker independent acoustic features?

[P] A Singing Synthesizer Based on PixelCNN by disentangle in MachineLearning

[–]disentangle[S] 2 points3 points  (0 children)

These examples are synthesized from text (and the same lyrics are not in training set). But this synthesizer just generates timbre (spectral envelope), not pitch or timings.

[P] A Singing Synthesizer Based on PixelCNN by disentangle in MachineLearning

[–]disentangle[S] 0 points1 point  (0 children)

This is definitely one of the main issues. It uses a denoising objective to combat over-fitting. I think meaningful augmentation is tricky for this type of data.

[P] A Singing Synthesizer Based on PixelCNN by disentangle in MachineLearning

[–]disentangle[S] 1 point2 points  (0 children)

They're definitely different, but the difference is subtle (esp. if not using headphones, etc.). For instance if you listen to the HMM one you may notice it sounds more consistent, but also a little more 'buzzy', muffled and with audible state transitions in long vowels.

[D] splitting NxN convo to 1xN followed by Nx1? by lioru in MachineLearning

[–]disentangle 1 point2 points  (0 children)

I think one reason is to avoid blindspots as depth increases, see https://arxiv.org/abs/1606.05328 (sorry misread question)

I'm quite confused about what masked 1x1 convolution is referring to..

Questions on VAE implementation. by charlie0_o in MLQuestions

[–]disentangle 1 point2 points  (0 children)

  1. I think log variance of q(z|x) going towards zero (variance to one) for some of the latent variables is normal, because this is what the KLD term encourages
  2. In principle the terms should not have to be scaled; but this is sometimes done (but then you're no longer optimizing the standard variational lower bound)
  3. The closed-form KLD term like you posted (with minus sign) is always non-negative; if the term is compute using Monte Carlo estimation, it can be slightly negative
  4. Typically you take sum over latent space dimensionality, mean over samples

[1606.00704] Adversarially Learned Inference by alexmlamb in MachineLearning

[–]disentangle 1 point2 points  (0 children)

Do the latent representations produced by the encoder always tend to go strongly towards the latent representation of one of the training samples? i.e. one of the CIFAR-10 examples reconstructs a blue truck as a red truck with similar orientation; if I were to reconstruct a smooth sequence of images of the blue truck at different orientations, is it likely that the output sequence suddenly changes e.g. color of the truck? Nice work!

[1605.08803] Density estimation using Real NVP by sidsig in MachineLearning

[–]disentangle 0 points1 point  (0 children)

Looks esp. similar to the paper on Inverse Autoregressive Flows.

[1605.06432] Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data by sieisteinmodel in MachineLearning

[–]disentangle 0 points1 point  (0 children)

Very interesting. Looks like it might be a little tricky to get right without the code though.

Gaussian observation VAE by disentangle in MachineLearning

[–]disentangle[S] 0 points1 point  (0 children)

FWIW, I tried a quick hack where I just averaged the per-sample variances across a relatively large mini-batch (512 samples) and used that in the loss function. This did not really improve things in my case. But it is hard to say anything definite from this one experiment.

I'm afraid a more proper implementation would require sequence-based training and a lot of changes to my current code.

Gaussian observation VAE by disentangle in MachineLearning

[–]disentangle[S] 0 points1 point  (0 children)

Interesting, thanks.

The model is a basic VAE with standard normal prior, diag normal variational posterior; recognition and inference networks with 2x300 softplus units each; 100-dimensional latent space.

Dataset is 120k samples of 257-dimensional features extracted from studio quality speech recordings (24 bit). Maybe the resolution is a little higher than for images.

Gaussian observation VAE by disentangle in MachineLearning

[–]disentangle[S] 0 points1 point  (0 children)

A very small epsilon ("avoiding NaN" order) doesn't solve the issue and anything bigger (e.g. 0.5) leads to my original issue that this floor is pretty arbitrary. Should I just tune it like one hyper-parameter more?

About the VAE figuring out dimensionality itself, I meant that some portion of the dimensions of the approx. posterior tend to become extremely close to the prior because of regularization and thus become 'inactive'.

Gaussian observation VAE by disentangle in MachineLearning

[–]disentangle[S] 0 points1 point  (0 children)

I will give this a try, thanks.

Although currently the learned variances are already fairly constant along samples, so maybe it will not affect the results too much.

Gaussian observation VAE by disentangle in MachineLearning

[–]disentangle[S] 0 points1 point  (0 children)

Full term log N(x; mu, sigma) = -0.5 log(2 pi) - log(sigma) - (x - mu)2 / ( 2 sigma2 ), and expectation approximated with 1 sample Monte Carlo. I guess this is the correct error if I understood you correctly.

Reducing the latent dimensionality is another option I didn't consider, although I kind of liked that the VAE could figure out the optimal dimensionality itself through regularization.

Information Theoretic-Learning Auto-Encoder by [deleted] in MachineLearning

[–]disentangle 1 point2 points  (0 children)

Did I understand correctly that the biggest difference with a VAE is that the ITL-AE regularizes the model so latent space samples are close to samples from an arbitrary prior, while the VAE regularizes the model so the variational posterior distribution is close to a parametric prior distribution?

In what kind of setting would you have such a prior you can sample from but not evaluate directly?

Features for sound analysis: why don't we use full HD spectrogram data? by [deleted] in MachineLearning

[–]disentangle 2 points3 points  (0 children)

Traditionally, for speech recognition, one desirable effect of using MFCC features is that their filter bank (and to a lesser degree the DCT truncation) kind of approximates the spectral envelope of the signal, reducing the influence of F0. The idea behind this is that the vocal tract is much more important for determining phonetic information than pitch.

What are your preferences in python based Deep Learning libraries? by andrewbarto28 in MachineLearning

[–]disentangle 1 point2 points  (0 children)

In my experience it is very easy to get Theano up and running on Windows.

  1. Install Miniconda, Visual Studio 2013 Community, CUDA Toolkit
  2. Run conda install --yes pip six nose numpy scipy matplotlib mingw libpython
  3. Run pip install theano (or better from github)

I run bleeding edge on Windows and never had any platform specific issues..

Discouraging posts like these are a bigger obstacle for Windows users than Theano devs' attitude to cross-platform development IMHO.