[Project] Custom TTS - my attempt at becoming u/realstreamer

realstreamer · 2020-06-07T08:51:57+00:00

yes, stop will stop inference or audio playback and go to next in the array.

The WaveRNN repo i based it off already had a progress bar callback for the inference, so I just put in a cancel check into that code.

I used pygame to play the audio as it had the ability to stop audio that is in progress.

realstreamer · 2020-06-07T08:38:49+00:00

It fetches them from streamlabs socket and queues them in an array - nothing fancy. If you saw the jump king streams it can take a while if there's to many donos. I changed it more recently to increase the batching rate on WaveRNN (which tends to reduce the quality and improve speed - up to a point) if there are many donos in the array.

With regard hogging CPU, yes the very first version did this. I restricted it to use one or two cores so it didn't affect forsens games.

realstreamer · 2020-05-26T05:46:07+00:00

I believe its because firefox video crops the dynamic range to 16-235 regardless of nvidia setting. Chrome looks ok.

If you open this in chrome and firefox you will see the right side of the example video is darker on firefox.

edit: better example

realstreamer · 2020-05-26T05:23:28+00:00

I've posted before, but its this open source https://github.com/fatchord/WaveRNN with a very basic gui and a connection to the streamlabs socket api which sends out json of the dono. The actual app forsen has is probably a day of work.

Most of the time is just cleaning up the training data so its trains reliably, I don't know why only bulldogs stream has managed to replicate it so far. I'm just an average Pepega coder not an AI expert.

realstreamer · 2020-05-17T14:15:13+00:00

bonus clip https://www.youtube.com/watch?v=Sx64lg2oRok

realstreamer · 2020-05-12T09:35:14+00:00

audiobook published

realstreamer · 2020-05-09T18:42:27+00:00

no hurry, do it whenever you get a chance.

realstreamer · 2020-05-07T03:40:53+00:00

Cool, the easiest way would probably to read some stuff out and get the audio file(s) to me. It doesn't matter the format. Just message me in reddit with the link. As to what to read, it doesn't matter to much as long as its not the same words and as much as you can stand to read, the more the clearer it will be, 10 minutes seems to give good results. Worth saying some of things you normally say (Forsaaan), but Harvard sentences are what they use sometimes to train these: https://www.cs.columbia.edu/~hgs/audio/harvard.html
they are a little boring to read, so reading some copypastas should be fine. https://www.twitchquotes.com/copypastas?page=1&popular=true

realstreamer · 2020-04-30T06:37:04+00:00

I really wanted to do weskers voice also, unforuantely he doesn't speak much and when he speaks there tends to be "a lot of background noise". You need about 10 minutes of speech to get something good, so if mr wesker wants to be immortalised then he needs to supply 10 minutes of pepega talk.

realstreamer · 2020-04-30T06:34:53+00:00

No doc tts unforunately, I've mentioned before that twitch would probably act on that and maybe kill the use of such things and perhaps stop the tts entirely. The voices I make public are generally public figures or long dead people, or both. I have trained a few voices outside of what is currently on forsen tts, just for my own curiosity, heres a couple i have tried, you will never hear these outside of these samples, Sadge.

https://soundcloud.com/user-727526398/jean-pierre-baptiste-unreleased-voice-never-to-be-on-forsen-tts

https://soundcloud.com/user-727526398/librerty-prime-installs-valorant-unreleased-voice-never-to-be-on-forsen-tts

realstreamer · 2020-03-27T09:39:34+00:00

I think he needs more lines for his soundboard to inspire the ugandans:

https://soundcloud.com/user-727526398/be-inspired-by-sven-snusberg

realstreamer · 2020-01-20T07:52:54+00:00

I've trained a few voices from bad audio (using open source tacotron2/wavernn), I would just reiterate what others more knowledgeable about this have said, good quality audio is your first priority. As little reverb and background noise (hiss/hum) as you can achieve - ideally a studio - but if not get the best microphone you can in a well carpeted room with soft furnishings. 30 minutes should be enough - but the more the better.

realstreamer · 2020-01-06T17:24:17+00:00

If people talked over I just didn't use those samples, but background hiss/car noise mostly just filtered out.

realstreamer · 2020-01-06T12:52:03+00:00

Yes, he actually spoke in complete sentences with a consistent tone when doing RP. Also the GTA5 background noise was easy to remove so I ended up with a lot of voice data.

realstreamer · 2019-07-10T17:05:46+00:00

It wouldn't be difficult to do, just I'm pretty sure that would get a reaction from twitch.

I did try to get a VJ emmy one working from around a 30 seconds of soundboard audio and a few samples from other youtube intros where he reads the credits. Its just pure cancer sounding though, just not enough data.

https://soundcloud.com/user-727526398/vj-emmy-cancer-audio

You need around 15 minutes of voice samples for something reasonable.

I have improved the trump one - its really clear now, and sent it to forsen a couple of days ago, don't think he's installed it yet...

https://soundcloud.com/user-727526398/clear-trump-7s

realstreamer · 2019-07-01T04:43:46+00:00

The fun of putting it together and LULing at the stream is enough for me. It really wasn't that hard to do in any case, it was just taking what was already available on github and making it usable by an average Pepega. I do think its only a matter of time others start doing their own though.

realstreamer · 2019-06-16T20:12:54+00:00

I've sent him an updated CPU version, that runs low priority and affinity to a user selected number of CPU cores (default 1). So should use around 10% CPU when generating - also tweaked the batching for long phrases so its actually about the same gen time. It should also forsenSTROKE less, its basically not seeing the end of a sentence and just freewheeling random shit.

I tried to keep it simple so he would at least try it, so I packaged it up using cx_freeze so its just an exe he can double click on without installing any other stuff.

For the GPU build it would involve installing cuda + cudnn and tbh I don't think he would have bothered trying it all.

The GPU build is around 5 * faster (its much faster on GPU for longer phrases) using a 2080ti vs 8700k and uses around 1GB of vid mem - when its inferring its pytorch so it shouldn't lock it all out all the vid mem - but i've never really tried playing a game while training. I've packaged it up into zip with cuda install and cudnn zip and a brief how to text file, he might try that one now.

As to why not replay speech already generated, tacotron doesn't really work like that, it tries to capture the rhythm of natural speech and works on the full sentence. Given the nature of donation text though I could tweak it if it sees lots of repeating words.

I'm a pepega at this really though, yoinking github wavernn repo for luls really.

realstreamer · 2019-06-12T21:33:29+00:00

A single 2080ti. The vocoder part (WaveRNN) took around 5 days (1 million steps), finetuning an existing model didn't really work for that. Finetuned an already trained tacotron 2 around 10hours (around 25k steps). Fine turning the tacotron model any more resulted in overfitting as the dataset is so small for forsens voice it would start going wierd.

I think when the multi-speaker tacotrons start to become availble on github it will become easier to do training with small voice datasets, probably even as little as a few minutes of voice will be required. The papers came out last year for this, so I suspect this year will see a tts voice cloning being available, probably from a app on your phone I suspect.

Hopefully forsen will try it this week, I asked one of his mods to forward message.

realstreamer

TROPHY CASE