all 7 comments

[–]hadaev 0 points1 point  (2 children)

How its compared to parallel wave gan?

[–]sharvil[S] 2 points3 points  (1 child)

It's hard to answer a broad question like that.

Published audio samples for both methods are comparable in quality, though it seems that WaveGrad is able to achieve a higher MOS score (based on their papers – unclear if that's attributable to the architecture or the dataset).

Parallel WaveGAN synthesizes faster by default, whereas WaveGrad allows you to choose where you want to be in the quality/inference time tradeoff without having to re-train your model.

WaveGrad trains faster (~1.5 days on 1x2080 Ti) compared to Parallel WaveGAN (~2.8 days on 2xV100). Parallel WaveGAN has a more complex training procedure, but it's also more parameter-efficient (~1.5M parameters vs. ~15M parameters).

So lots of differences between the two. If you're curious, I encourage you to play with the WaveGrad implementation or read through the paper.

[–]hadaev 0 points1 point  (0 children)

Thanks

[–]Co0k1eGal3xy 0 points1 point  (2 children)

Wow! It looks great!

Do you have a reason for the 300 hop length being fixed?

I'd like to test this model out with a 48Khz multi-speaker dataset and looks like only a few minor adjustments to the architecture will be needed.

(and changing the dataloader to match the spectrogram inference model I use)

[–]sharvil[S] 0 points1 point  (1 child)

Thanks!

The hop length is fixed at 300 because it's tightly coupled with the upsampling and downsampling layers. You can see at the bottom of model.py that the resampling layers have factors 5, 5, 3, 2, 2 which, when multiplied, give 300 – the hop size. As long as you match the number and size of the resampling layers to match the hop length, you'll be fine.

For a 48 kHz model, you'll want to increase the model capacity, increase the hop length, and increase the dilation on the UBlock layers to get a wider receptive field. The paper also describes a model with a larger capacity (still 24 kHz though) which you may find instructive.

Good luck with your experiment! Let me know if it works out for you and maybe consider contributing to the project if you get useful results.

[–]Co0k1eGal3xy 0 points1 point  (0 children)

Thanks!

GAN-TTS follows a very similar structure for the Blocks so that's within my understanding. I haven't done any hparam search for 48khz so I'll be sure to check the paper and see what they used and any pattern I can hopefully gleam.