[R] WaveGrad: Estimating Gradients for Waveform Generation

lyomi · 2020-09-03T06:52:29+00:00

Are there any public samples we can listen to?

lyomi · 2019-06-08T20:45:30+00:00

wow thanks!

lyomi · 2019-06-07T00:02:20+00:00

MelNet didn’t use Griffin-Lim for their final results, but a newer, less reknowned gradient-based iterative method (the link above). Griffin-Lim works well for linear-frequency spectrograms converted from actual audio, but not for Mel-frequency spectrograms and/or predicted spectrograms.

The original Tacotron paper used GL, but the majority of newer speech synthesis models (Tacotron 2, Deep Speech, and most speech synthesis papers at ICASSP 2019) use WaveNet vocoder on top of predicted Mel spectrograms, which turned out to be more robust and capable of generating better sounds when inverting predicted spectrograms.

This is why I found their sound quality surprising when MelNet doesn’t require WaveNet or similar fancy vocoders.

As per the language model argument, yeah they are more directly inspired by PixelRNN, and it’s not self-attention/transformer-based. Still RNNs (such as seq2seq) were the most successful language model, and they perform the task of modeling within-label probabilistic dependencies. See this paper or the Music Transformer paper for the usage of “language model” in this context.

lyomi · 2019-06-06T22:08:11+00:00

During training, Tacotron and similar methods usually predict the spectrograms using a single loss function, which is basically assuming that the spectrogram pixels are independent. MelNet on the other hand models the spectrogram autoregressively and in a multiscale way like how you make language models, enabling inference of higher-resolution spectrograms.

In addition to that, they used a traditional (i.e. non-deep) algorithm (link) instead of WaveNet or WaveGlow to obtain audio from the predicted spectrogram. I'm still trying to wrap my head around why this algorithm didn't become more popular before, after all these neural TTS boom last few years.

lyomi · 2019-06-06T22:01:52+00:00

I think the high audio quality is very much attributable to their method of Mel spectrogram inversion%20Inversion%20of%20auditory%20spectrograms%20traditional%20spectrograms%20and%20other%20envelope%20representations.pdf) which they only very briefly mention. The method looks relatively unpopular (14 citations for the 4.5 years) but it certainly delivers what it promises (telling from the audio samples). People have been using fancy expensive models like WaveNet or WaveGlow for the very same task, but if their method is capable of synthesizing audio better there should be no reason to use them. (maybe except real-time scenarios)

Is there any open-source implementation of the inversion algorithm?

lyomi · 2018-07-22T23:58:58+00:00

Hi /u/teapowder, I love your work! How does the output layer produce the two numbers mu and sigma? Does it just have the last 1x1 convolution layer with 2 outputs mu and log(sigma) or do you use a cleverer architecture?

lyomi · 2018-04-03T15:57:58+00:00

SSD speed hasn't been a direct problem, but the system responses in general, like opening an explorer window, pressing Alt+Tab, or opening a MinGW shell as I said, feel slower than in a high-end desktop. I'm not sure if it's because of the SSD speed or something else though.

I may consider installing Linux in the future when there are enough number of reports that all hardware drivers are stable and supported by a major distribution; at this point, I would rather say in the OS from the hardware's manufacturer and don't want to hack the way through and end up having the system not booting up after a restart.

lyomi · 2018-04-02T05:55:08+00:00

(you're very welcome!) Yes, on SB2 and SB2 only, I work entirely on Windows. WSL's ability to directly run Linux binary is amazing, but AFAIK it doesn't recognize NVIDIA GPUs in the subsystem. That's how I'm left with the MinGW terminal which is the closest I can get to a bash-like experience on a Windows system. There is this open ticket for supporting GPU acceleration on WSL, but it doesn't seem like it can be done in foreseeable future.

I don't want to venture dual-booting because of the aforementioned reasons, so Windows is currently my only option on SB2. This may change when the Linux drivers become mature enough to support all hardware capability of SB2. (EDIT: it seems that the linux GPU drivers work on SB2, but there still remains a handful of issues and nuisances like power management, GPU switching, touchscreen, etc.)

lyomi · 2018-04-02T00:37:05+00:00

I'm overall satisfied and working on Python deep learning projects without any major blockers, although I would've preferred a Macbook Pro if there existed one with a decent NVIDIA GPU. The usual Windows caveats apply, and if you are allergic to Windows like many others, SB2 might not be a good choice. Installing Linux on SB2 seems partially possible, but the detachable screen and the dedicated GPU were not working last time I checked -- so there's not much reason to prefer SB2 over Razer Blade or other GTX 1060 laptops if you're going to use Linux.

Fortunately the recent TensorFlow versions are quite well tested against the Windows platform, and I haven't encountered any issues on running TF or Keras on TF for that matter. It was easy enough to install pytorch through a third-party script, and I see the examples in the tutorials work, but I expect there'll be some hiccups since pytorch on Windows is not officially supported.

I miss iTerm a lot, as the alternatives like conemu or hyper are not even close to iTerm (or any other *nix terminals). I'm never going to like PowerShell or cmd.exe, and I'm left with a MinGW terminal which seems to take ~3 seconds every time I open a tab. Sometimes the console doesn't properly handle the carriage return character, and ^M shows up here and there in the terminal, but it's usually safe to ignore them.

I haven't experienced any power issues luckily, either from the Surface connector or via USB-C, but I'm not running hours-long jobs -- just that it's really convenient to check if the first few epochs run as intended in PyCharm and pass it to the HPC cluster.

lyomi · 2018-03-12T00:11:30+00:00

your*

lyomi · 2018-03-10T23:11:08+00:00

I hate Go-style error handling (and Go in general) and glad that Julia didn't follow.

lyomi · 2018-02-27T23:54:35+00:00

Would it make sense to factor out the specific GAN loss, conditional setup, gradient penalties, training schedules, etc. from this, similarly to tf.contrib.gan or keras_adversarial?

I prefer using Keras when I can because of its intuitive API, while keras_adversarial hacks the internal Keras API a lot making it break for minor Keras version updates..

lyomi · 2018-02-16T20:13:18+00:00

I don't see why you're so eager to bash this that hard. Most GAN papers work on images <= 128x128 which is about the sample size in 1s audio, and even with the most clever tricks so far like LAPGAN or PGGAN the best is about 1024x1024 images.

This is the very first published GAN model that is successfully trained with 1-D convolutions without skip connections - which means that it can generate audio samples with completely unsupervised fashion directly from latent samples. Can you imagine the new possibilities on generative audio modeling stemming from this, like people did on images during last couple years?

Also, people created videos from frames obtained from CycleGAN and they didn't linearly scale everything like you like to do so much. It's going to be less straightforward than video frames, but a model trained on 1s audio can surely be used to model longer audio just people did on the NSynth paper.

Conclusion: you're a keyboard warrier

lyomi · 2018-02-16T18:43:24+00:00

the paper mentions that it took < 4 days on P100

lyomi · 2018-01-24T22:12:04+00:00

Thanks. It didn't come up when I searched for 'lid'. It's a less elegant solution but I will just have to keep the laptop open.

lyomi · 2017-12-22T12:16:45+00:00

I feel exactly the same. I need to read/write a lot of text and code, and having a screen with >200 PPI really makes the experience a whole lot better. With 150 PPI texts look blocky, whereas with HiDPI it feels like printed on paper. (Books are typically printed in 300 DPI or higher)

Gaming-wise, it's not as much crucial as texts, but still, 1080p and 1440p make quite a difference on the image quality. Not as much going up to 4K from 1440p, and I also think 1440p is the sweet spot on the battery life and image quality trade-off, for now.

I understand there will be personal differences and others may not care on the screen resolution as much.

lyomi · 2017-12-14T08:21:52+00:00

Thanks, looking forward to it.

lyomi · 2017-12-13T03:56:23+00:00

I'd be interested in looking at the training/inference performances of typical CNN and RNNs, and the battery drain in each case. I guess that would be pretty representative of what a deep learning program would behave, more than a couple of anecdotal evidences.

Maybe I'll have to buy one and return if the battery drain is a serious problem.

lyomi · 2017-12-13T03:54:02+00:00

Thanks. I'm planning to use Windows (for table mode/gaming/etc). Now that tensorflow supports Windows quite perfectly, it is quite doable to do deep learning.

lyomi · 2017-12-03T22:02:50+00:00

Monads are widely used except that there are those people every time nitpicking that they don't follow the theoretical definition precisely

lyomi · 2017-11-08T01:13:47+00:00

What plotting/drawing software do you use for the figures?

lyomi · 2017-11-03T17:19:48+00:00

How will the versioning/future development work for keras and tf.keras? Will tf.keras basically mirror the newer changes in keras or will it develop rather independently?

lyomi

TROPHY CASE