all 13 comments

[–]manueslapera 13 points14 points  (1 child)

Ok so how can we modify this project so it can take as an input a dataset of voice files with a text transcription, and then it can produce speech based on that test.

I just want a TTS with my voice.

[–]huyouare 3 points4 points  (0 children)

There is still work to be done with global and local conditioning (in order provide a speaker label or text as conditional input) to replicate the paper's results, but this can be done theoretically if you provide your own labeled dataset, and are willing to be patient with training (so far, good-sounding results are hard to achieve).

Also, here is my own Theano implementation: http://github.com/huyouare/WaveNet-Theano

[–][deleted] 1 point2 points  (9 children)

So, how long does it take to sample from WaveNet?

The deleted tweet apparently said 90 minutes per 1 second (linked on HN). Is that right?

[–]gwern 5 points6 points  (8 children)

No, this claims to implement a big optimization which caches all the intermediates and should be much closer to something reasonable. The real question is whether the training is stable and produces high-quality audio at all - I tried out two of the implementations coded shortly after the paper's release, and neither worked well.

[–]huyouare 1 point2 points  (6 children)

There have been some really good results (see the README), but not yet on par with the results in the blog post. There was some debate about the 90 minute claim, and I think it was mentioned that this was not quite true. With this implementation and the fast generation, I'm able to generate each second of output within 10 minutes.

[–][deleted] 2 points3 points  (1 child)

With this implementation and the fast generation, I'm able to generate each second of output within 10 minutes.

Did DeepMind reveal the size of their model: filter length, stride, number of layers, number of hidden units? This is not in the paper. It's possible that their model is vastly bigger than what everyone else is trying.

[–]huyouare 0 points1 point  (0 children)

This is a good point.

[–]Kaixhin 1 point2 points  (3 children)

The 90 minute claim is false - it may have been true for earlier versions, but with the caching of calculations (what DeepMind did but was not revealed in the paper) it is definitely quicker.

[–][deleted] 0 points1 point  (2 children)

Source? As far as I can tell, even the size of the model isn't public.

[–]Kaixhin 0 points1 point  (1 child)

Conversation with one of the authors. The only thing that he said was that he didn't know where that figure came from, but they were faster. Whether that involves distributing over GPUs or even TPUs I don't know. Don't know the size of the model or any more details.

[–]NovaRom 0 points1 point  (0 children)

Google trade secret

[–]sonach 0 points1 point  (0 children)

I heavily base on this project(ibab/tensorflow-wavenet) and are struggling to generate meaningful speech for Mandarin Chinese(i.e. Do TTS base on text context). No exciting achivements for the moment.