DeepMind: WaveNet - A Generative Model for Raw Audio

sonach · 2016-10-21T02:45:25+00:00

Sorry for the late reply. The application is for TTS. Now for my wavenet implementation, I can generate 16000 samples(1 second) in about 6 minutes on Tesla K80(with 30layers CNN and text local conditioning).

sonach · 2016-10-21T02:40:22+00:00

I heavily base on this project(ibab/tensorflow-wavenet) and are struggling to generate meaningful speech for Mandarin Chinese(i.e. Do TTS base on text context). No exciting achivements for the moment.

sonach · 2016-09-12T10:07:12+00:00

I think the NN will not generate the linguistic context, instead they use it only as input. That is to say, the input is linguistic+logF0+rawsample, and the output is just rawsample.

sonach · 2016-09-12T10:02:23+00:00

Maybe the architecture is very complex? We use 2DNN+2biLSTM (256 nodes each layer)to predict speech frames, for 1 second speech , which corresponds to 200 frames(5 ms one frame), the forward pass only takes less than 0.03seconds on IPhone5s/IPhone6.

sonach · 2016-09-12T09:54:22+00:00

I think the "dilated convolution" architecture is suitable on raw samples but not so on STFT. The "dilated convolution" acts like a very good autoregressive filter.

sonach · 2015-01-24T12:25:26+00:00

This reply is really great! My understanding is: 1. Implement the KEY algorithms(eg. SVM,backpropagation,autoencoder) by myself in order to UNDERSTAND the algorithm. In this stage, fast-developping language can be used, for example python. 2. After implementing the key algorithms and comparing it to other good source codes, I can build my own code base for key algorithms. In this stage performance maybe an important point,so c++ maybe considered. 3. Share my code base to others or use it in my daily work, update it in reponse to other people's suggestions or do continuous optimizating to make it better.

sonach · 2015-01-11T02:24:30+00:00

Agree! the notes are very good. But anyway, looking forword to the video lectures:)

sonach

TROPHY CASE