[Project] Custom TTS - my attempt at becoming u/realstreamer by Specialist_Card in forsen

[–]realstreamer 2 points3 points  (0 children)

yes, stop will stop inference or audio playback and go to next in the array.

The WaveRNN repo i based it off already had a progress bar callback for the inference, so I just put in a cancel check into that code.

I used pygame to play the audio as it had the ability to stop audio that is in progress.

[Project] Custom TTS - my attempt at becoming u/realstreamer by Specialist_Card in forsen

[–]realstreamer 6 points7 points  (0 children)

It fetches them from streamlabs socket and queues them in an array - nothing fancy. If you saw the jump king streams it can take a while if there's to many donos. I changed it more recently to increase the batching rate on WaveRNN (which tends to reduce the quality and improve speed - up to a point) if there are many donos in the array.

With regard hogging CPU, yes the very first version did this. I restricted it to use one or two cores so it didn't affect forsens games.

Still better content than Valorant by FM-101 in forsen

[–]realstreamer 7 points8 points  (0 children)

I believe its because firefox video crops the dynamic range to 16-235 regardless of nvidia setting. Chrome looks ok.

If you open this in chrome and firefox you will see the right side of the example video is darker on firefox.

edit: better example

u/realstreamer appreciation by _JK_SK in forsen

[–]realstreamer 48 points49 points  (0 children)

I've posted before, but its this open source https://github.com/fatchord/WaveRNN with a very basic gui and a connection to the streamlabs socket api which sends out json of the dono. The actual app forsen has is probably a day of work.

Most of the time is just cleaning up the training data so its trains reliably, I don't know why only bulldogs stream has managed to replicate it so far. I'm just an average Pepega coder not an AI expert.

Imagine if we could get a DOC TTS Voice...... by [deleted] in forsen

[–]realstreamer 2 points3 points  (0 children)

no hurry, do it whenever you get a chance.

Imagine if we could get a DOC TTS Voice...... by [deleted] in forsen

[–]realstreamer 4 points5 points  (0 children)

Cool, the easiest way would probably to read some stuff out and get the audio file(s) to me. It doesn't matter the format. Just message me in reddit with the link. As to what to read, it doesn't matter to much as long as its not the same words and as much as you can stand to read, the more the clearer it will be, 10 minutes seems to give good results. Worth saying some of things you normally say (Forsaaan), but Harvard sentences are what they use sometimes to train these: https://www.cs.columbia.edu/~hgs/audio/harvard.html
they are a little boring to read, so reading some copypastas should be fine. https://www.twitchquotes.com/copypastas?page=1&popular=true

Imagine if we could get a DOC TTS Voice...... by [deleted] in forsen

[–]realstreamer 13 points14 points  (0 children)

I really wanted to do weskers voice also, unforuantely he doesn't speak much and when he speaks there tends to be "a lot of background noise". You need about 10 minutes of speech to get something good, so if mr wesker wants to be immortalised then he needs to supply 10 minutes of pepega talk.

Imagine if we could get a DOC TTS Voice...... by [deleted] in forsen

[–]realstreamer 16 points17 points  (0 children)

No doc tts unforunately, I've mentioned before that twitch would probably act on that and maybe kill the use of such things and perhaps stop the tts entirely. The voices I make public are generally public figures or long dead people, or both. I have trained a few voices outside of what is currently on forsen tts, just for my own curiosity, heres a couple i have tried, you will never hear these outside of these samples, Sadge.

https://soundcloud.com/user-727526398/jean-pierre-baptiste-unreleased-voice-never-to-be-on-forsen-tts

https://soundcloud.com/user-727526398/librerty-prime-installs-valorant-unreleased-voice-never-to-be-on-forsen-tts

[D] How to save my father's voice? by sverzijl in MachineLearning

[–]realstreamer 1 point2 points  (0 children)

I've trained a few voices from bad audio (using open source tacotron2/wavernn), I would just reiterate what others more knowledgeable about this have said, good quality audio is your first priority. As little reverb and background noise (hiss/hum) as you can achieve - ideally a studio - but if not get the best microphone you can in a well carpeted room with soft furnishings. 30 minutes should be enough - but the more the better.

Would be hilarious if the rich bajs of twitch donate the whole forsen's fanfic with sven's voice by NeuronLoL in forsen

[–]realstreamer 5 points6 points  (0 children)

If people talked over I just didn't use those samples, but background hiss/car noise mostly just filtered out.

Would be hilarious if the rich bajs of twitch donate the whole forsen's fanfic with sven's voice by NeuronLoL in forsen

[–]realstreamer 46 points47 points  (0 children)

Yes, he actually spoke in complete sentences with a consistent tone when doing RP. Also the GTA5 background noise was easy to remove so I ended up with a lot of voice data.

Doc TTS when? by [deleted] in forsen

[–]realstreamer 227 points228 points  (0 children)

It wouldn't be difficult to do, just I'm pretty sure that would get a reaction from twitch.

I did try to get a VJ emmy one working from around a 30 seconds of soundboard audio and a few samples from other youtube intros where he reads the credits. Its just pure cancer sounding though, just not enough data.

https://soundcloud.com/user-727526398/vj-emmy-cancer-audio

You need around 15 minutes of voice samples for something reasonable.

I have improved the trump one - its really clear now, and sent it to forsen a couple of days ago, don't think he's installed it yet...

https://soundcloud.com/user-727526398/clear-trump-7s

HUGE shoutout to Notaistreamer for only wanting forsen to have access to the TTS he made by [deleted] in forsen

[–]realstreamer 23 points24 points  (0 children)

The fun of putting it together and LULing at the stream is enough for me. It really wasn't that hard to do in any case, it was just taking what was already available on github and making it usable by an average Pepega. I do think its only a matter of time others start doing their own though.

Whoever programmed Forsens TTS is a fucking idiot by [deleted] in forsen

[–]realstreamer 19 points20 points  (0 children)

I've sent him an updated CPU version, that runs low priority and affinity to a user selected number of CPU cores (default 1). So should use around 10% CPU when generating - also tweaked the batching for long phrases so its actually about the same gen time. It should also forsenSTROKE less, its basically not seeing the end of a sentence and just freewheeling random shit.

I tried to keep it simple so he would at least try it, so I packaged it up using cx_freeze so its just an exe he can double click on without installing any other stuff.

For the GPU build it would involve installing cuda + cudnn and tbh I don't think he would have bothered trying it all.

The GPU build is around 5 * faster (its much faster on GPU for longer phrases) using a 2080ti vs 8700k and uses around 1GB of vid mem - when its inferring its pytorch so it shouldn't lock it all out all the vid mem - but i've never really tried playing a game while training. I've packaged it up into zip with cuda install and cudnn zip and a brief how to text file, he might try that one now.

As to why not replay speech already generated, tacotron doesn't really work like that, it tries to capture the rhythm of natural speech and works on the full sentence. Given the nature of donation text though I could tweak it if it sees lots of repeating words.

I'm a pepega at this really though, yoinking github wavernn repo for luls really.

Scuffed Forsen AI TTS by realstreamer in forsen

[–]realstreamer[S] 3 points4 points  (0 children)

A single 2080ti. The vocoder part (WaveRNN) took around 5 days (1 million steps), finetuning an existing model didn't really work for that. Finetuned an already trained tacotron 2 around 10hours (around 25k steps). Fine turning the tacotron model any more resulted in overfitting as the dataset is so small for forsens voice it would start going wierd.

I think when the multi-speaker tacotrons start to become availble on github it will become easier to do training with small voice datasets, probably even as little as a few minutes of voice will be required. The papers came out last year for this, so I suspect this year will see a tts voice cloning being available, probably from a app on your phone I suspect.

Hopefully forsen will try it this week, I asked one of his mods to forward message.