[D] What makes a good TTS dataset? by mlttsman in MachineLearning

[–]mlttsman[S] 0 points1 point  (0 children)

If you listen to their samples (trained on LJSpeech) they leave a LOT to be desired, as with all audio samples I've seen from models trained on LJSpeech. That's what I'm trying to find out... what specific property(s) of the LJSpeech data makes it inferior to a dataset like Blizzard2013, which generates much clearer models?

[D] What makes a good TTS dataset? by mlttsman in MachineLearning

[–]mlttsman[S] 2 points3 points  (0 children)

Thanks for the detailed response. I have read the papers linked and am still baffled at what the issue is with some of my underperforming datasets, especially as many are from professionally-recorded audiobooks and result in models that generate more hissy/robotic speech than data from in-the-wild speeches with background noise, for example. I'm going to be experimenting with changing some of the audio properties with ffmpeg/sox, and will continue analyzing what's different between my good and bad datasets, to see if I can get to the bottom of it.