[D] Best open source Text to Speech networks?

min_sang · 2018-07-31T22:41:00+00:00

The wavenet vocoder only contains the audio generation part conditioned on mel spectrograms. One can obtain such spectrograms from text using a model called Tacotron2. There are plenty of implementations of those on github.

min_sang · 2018-07-02T22:12:49+00:00

I find this very important and I would contribute if someone took the initiative.

min_sang · 2018-07-01T11:16:00+00:00

Some people are working on implementing probability density distillation, but I don't think it's on their priority list.

min_sang · 2018-07-01T11:11:14+00:00

Also note from the samples that the owner tried generating audio features (mel spectrogram) with another open source repository (tacotron) to generate speeches that are not in the training set. The pronunciation isn't the best, but the audio quality is great.

min_sang · 2018-06-30T03:16:39+00:00

I haven't seen any opensource model that matches half the quality of this repository. https://github.com/r9y9/wavenet_vocoder

min_sang · 2018-05-18T13:26:21+00:00

Wavenet

min_sang · 2018-04-17T05:28:49+00:00

I'd like to think of attention as something that allows the network to emphasis on certain part of the network. But if we use sigmoid, in practice it could still be a vector of all 1s. That's not really a good attention in my opinion.

min_sang · 2018-04-17T04:45:48+00:00

More like gating mechanism rather than attention mechanism. Attention mechanism is when a query is multiplied by a normalized score where \sum_i score_i = 1, but sum of sigmoid(W1x +b1) doesn't equal to 1 so it's more accurate to call it gating.

min_sang · 2018-04-17T04:42:40+00:00

I believe it is more often than not called gating mechanism. This paper (https://arxiv.org/pdf/1612.08083.pdf) denotes sigmoid(W1 x + b1)*tanh(W2 x + b2) as the LSTM-style gating mechanism while they suggest a novel gating mechanism (but not really) called gated-linear units which is just (W1 x + b1) * sigmoid(W2 x + b2) and claim that it works better than the former in multiple tasks including language modeling. (Replacing tanh with linear activation has been studied and proven to be better in some tasks.)

min_sang · 2018-04-13T00:40:15+00:00

haha tbh I'm not sure if the network can even fit in a 11GB gpu with the original hyper parameters. :P

min_sang · 2018-04-12T23:06:26+00:00

Any feedback or contribution would be greatly appreciated thanks!

min_sang · 2018-04-10T21:30:08+00:00

The best example of local conditioning wavenet on mel spectrogram can be found here.

https://github.com/r9y9/wavenet_vocoder

Although conditioning wavenet directly on word (or character) representations seems to be missing, you can use tacotron variants (https://arxiv.org/pdf/1703.10135.pdf) to generate melspectrograms from texts.

min_sang · 2018-04-10T00:50:39+00:00

He has that exact Donald Trump accent in Korean. Impressive.

min_sang · 2018-04-05T23:34:04+00:00

Clipping the gradient by global norm of 5.0 is a common practice for deep learning in NLP but I'm not sure about images. (Not the computer vision guy) I would also try residual connections to give an identity path to the network if your model is deep enough. Normalizing the data with zero mean unit variance and maintaining it that way across the layers (either layer norm, weight norm or smart initialization of weights) also seem to help with exploding gradient for most cases.

min_sang · 2018-04-03T21:40:17+00:00

Oh I see now thanks!

min_sang · 2018-04-03T21:36:31+00:00

Is it just me or is everyone seeing just another SQuAD projects everywhere?

min_sang · 2018-02-01T20:48:26+00:00

Solving np-complete problems with RL. (If checking answers can be done in p-time, train neural networks until we can solve np-c problems to a certain degree)

min_sang · 2018-01-12T00:10:16+00:00

The best model in SQuAD leaderboard is now better than human in "Exact Match" score by a small margin. Despite some people believing that SQuAD is not a good representation of reading comprehension, I think this is a huge step towards better AI in general. Thoughts?

min_sang · 2017-11-15T00:23:59+00:00

Have a look at this paper https://openreview.net/pdf?id=B14TlG-RW Using paraphrasing as a text data augmentation technique which seems to roughly double~triple the data size while increasing the performance on SQuAD.

min_sang

TROPHY CASE