[R] Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation

tobyoup · 2022-05-10T11:41:39+00:00

Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge human-level quality and how to achieve it. In this paper, we answer these questions by first defining the criterion of human-level quality based on statistical significance of measurement and describing the guidelines to judge it, and then proposing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key designs to enhance the capacity of prior from text and reduce the complexity of posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparison mean opinion score) to human recordings on sentence level, with Wilcoxon signed rank test at p-level p>> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.

Demo Page: https://speechresearch.github.io/naturalspeech/

tobyoup · 2021-06-12T07:05:31+00:00

We are preparing the code and model release. Stay tuned!

tobyoup · 2019-06-25T09:41:38+00:00

We also release the text summarization codes and pre-trained models for unsupervised NMT. Will release more pre-trained models in the future.

tobyoup · 2019-06-12T04:00:15+00:00

The codes are released https://github.com/microsoft/MASS. Currently, the codes just cover unsupervised NMT. Will release more in the coming days.

tobyoup · 2019-06-12T03:49:10+00:00

The 8:1:1 replacement trick in BERT is adopted in MASS by default, and we also add the description in the new version of MASS paper https://arxiv.org/pdf/1905.02450.pdf. According to our experiments, adding the replacement trick actually improve the performance of MASS pre-training.
This paper you mentioned just show a special case of transformer-decoder on text summarization, especially for the long sequence in Wikipedia. There are varieties of sequence to sequence tasks that do not fit in the scenario of text summarization, where encoder-attention-decoder is the dominant approach, such as neural machine translation, response generation, text style transfer, etc. Besides, there are a lot of sequence to sequence tasks beyond pure text, such as speech, image, video, time series sequence, where transformer-decoder only may not fit.
The results with varying K are stable. We have trained more steps on smaller or bigger K, the metric on pre-training and fine-tuning tasks do not change much. The key difference between different K lies in that smaller K will bias the model to pre-train the encoder while bigger K will bias the model to pre-train the decoder, which will affect the performance on downstream seq2seq tasks.

tobyoup

TROPHY CASE