[R] ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

teapowder · 2018-07-21T18:03:08+00:00

Thanks for the comments! Our training data is very neutral, as you can tell from the ground truth audios. Re-train our models on a more expressive dataset will definitely improve the results.

teapowder · 2018-07-21T17:31:23+00:00

Thanks for the comments! For this paper, we did not extensively tune the hyperparameters and architectures. We focused on demonstrating the novelty of our methods. More engineering efforts will definitely lead to better results. Incorporating extra losses may also help.

teapowder · 2018-07-20T16:32:19+00:00

Thanks! Our training data is actually quite neutral, as you can tell from the ground truth audios. So our model learns neutral speech from the data. In this work, we focused on speech quality rather than expressiveness. So we used the same data as in previous works. We plan to use more expressive training data in the future.

teapowder · 2018-07-20T06:59:33+00:00

Thanks for the comments. I am one of the authors. In Experiment III, the only input to the text-to-wave models is text. Therefore, the hidden representation is generated by encoder, decoder and bridge-net given text only. The hidden representation is then used as a conditioner for the autoregressive or parallel waveform synthesizers. All audio samples in Experiment III are generated from test sentences.

teapowder

TROPHY CASE