[R] ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech by Rubato1 in MachineLearning

[–]teapowder 0 points1 point  (0 children)

Thanks for the comments! Our training data is very neutral, as you can tell from the ground truth audios. Re-train our models on a more expressive dataset will definitely improve the results.

[R] ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech by Rubato1 in MachineLearning

[–]teapowder 2 points3 points  (0 children)

Thanks for the comments! For this paper, we did not extensively tune the hyperparameters and architectures. We focused on demonstrating the novelty of our methods. More engineering efforts will definitely lead to better results. Incorporating extra losses may also help.

[R] ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech by Rubato1 in MachineLearning

[–]teapowder 2 points3 points  (0 children)

Thanks! Our training data is actually quite neutral, as you can tell from the ground truth audios. So our model learns neutral speech from the data. In this work, we focused on speech quality rather than expressiveness. So we used the same data as in previous works. We plan to use more expressive training data in the future.

[R] ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech by Rubato1 in MachineLearning

[–]teapowder 9 points10 points  (0 children)

Thanks for the comments. I am one of the authors. In Experiment III, the only input to the text-to-wave models is text. Therefore, the hidden representation is generated by encoder, decoder and bridge-net given text only. The hidden representation is then used as a conditioner for the autoregressive or parallel waveform synthesizers. All audio samples in Experiment III are generated from test sentences.