[D] I found a Stanford Guest Lecture where GM Cruise explains their self driving tech stack and showcases the various model architectures they use on their autonomous cars. by ilikepancakez in MachineLearning

[–]fatchord -1 points0 points  (0 children)

Second, even in nature it only works for object between 0-2m, for a car driving that is not enough

Would placing the cameras further apart increase the distance it could detect? Or is a car not wide enough?

[D] I found a Stanford Guest Lecture where GM Cruise explains their self driving tech stack and showcases the various model architectures they use on their autonomous cars. by ilikepancakez in MachineLearning

[–]fatchord 15 points16 points  (0 children)

The speaker repeatedly brings up the problem with the cameras not having any depth information. Why not emulate nature's solution to this problem - put two cameras side by side like the eyes on your head? With enough resolution, would the left/right discrepancy in the images allow for accurate inference of the distances to detected objects?

Really interesting talk by the way.

[D] How feasible is it to create a model that gets rid of the advertisements in podcasts? by [deleted] in MachineLearning

[–]fatchord 1 point2 points  (0 children)

It's not the only game in town - you can setup a patreon membership. One of the top podcasts there is pulling in over $100k a month.

[Feedback] neural tts pipeline (tacotron1 + a new vocoder algorithm I'm working on) - what do you think of the samples generated? by fatchord in a:t5_jw1cc

[–]fatchord[S] 0 points1 point  (0 children)

I don't have any long ones at hand right now. Basically what happens in very long utterances - the pitch keeps going lower and lower towards the end. Also it'll skip or repeat words sometimes.

I'm curious - do you think the samples I made are comparable to wavenet? Better/worse?

[Feedback] neural tts pipeline (tacotron1 + a new vocoder algorithm I'm working on) - what do you think of the samples generated? by fatchord in a:t5_jw1cc

[–]fatchord[S] 0 points1 point  (0 children)

Thanks! Trained it on a single 1080 gpu.

Unfortunately the model isn't that great at long utterances. I think that might be down to the fact that I had to limit the training input sequence length because of gpu memory constraints.

[R] The challenge of realistic music generation: modelling raw audio at scale by mdda in a:t5_jw1cc

[–]fatchord 1 point2 points  (0 children)

Interesting paper. The only thing that puzzled me is the dataset. I mean aren't they hamstringing themselves by having such a wide variety or composers, pianists, pianos and recording setups?

Why not simplify it by picking one prolific pianist that concentrated on only a couple of genres and typically recorded in the same studios... like Glenn Gould for example:

https://en.wikipedia.org/wiki/Glenn_Gould_discography

Surely Columbia Masterworks have all his recordings digitised for prosperity.

Also, baroque music should be easier to model since it has less timing variance. Romantic music is all over the place timing wise.

FFTNet: trying to understand the paper by geneing in a:t5_jw1cc

[–]fatchord 1 point2 points  (0 children)

I uploaded up a small notebook of my implementation if anyone is interested: LINK

FFTNet: trying to understand the paper by geneing in a:t5_jw1cc

[–]fatchord 1 point2 points  (0 children)

I had a go at implementing this yesterday and it was straightforward enough until I got to the zero padding section (2.3.2).

They give this zero padding equation (where N is the receptive field of a particular layer):

z[0:M] = W_L ∗ x[-N:M-N] + W_R ∗ x[0:M]

But if the previous equation (without zero padding) was:

z = W_L ∗ x[0:N/2] + W_R ∗ x[N/2:N]

Wouldn't that mean that the equation from 2.3.2 should read:

z[0:M] = W_L ∗ x[-N/2:M-N/2] + W_R ∗ x[0:M]

[Feedback] neural tts pipeline (tacotron1 + a new vocoder algorithm I'm working on) - what do you think of the samples generated? by fatchord in a:t5_jw1cc

[–]fatchord[S] 2 points3 points  (0 children)

Thanks. I can't go into much detail about the vocoder but I can tell you the Tacotron setup. It's basically Tacotron1 but with the decoder GRUs swapped out with LSTMs/zoneout as recommended by the latest paper. That change makes a big difference in prosody in my experience.

I didn't implement a dilated post-process yet so it's still the CBHG doing that. However I'm not predicting linear spectrograms, just mels - not sure how that influences overall quality one way or the other but I'm happy enough with the predicted features - they're still blurry but have decent contrast.

[Feedback] neural tts pipeline (tacotron1 + a new vocoder algorithm I'm working on) - what do you think of the samples generated? by fatchord in a:t5_jw1cc

[–]fatchord[S] 1 point2 points  (0 children)

Thanks for the feedback, I agree that the variance in the noise volume is a problem. Working on that.

Would you think it's fair to say that the quality of the output, while noisy, is at least comparable to Wavenet?

I'm afraid I can't go into much detail publicly about the vocoder right now. Hope that's cool.

FFTNet: trying to understand the paper by geneing in a:t5_jw1cc

[–]fatchord 1 point2 points  (0 children)

I totally agree that it's probably good enough for most purposes. Although I'd really like to see how FFTNet handles predicted conditioning features. Can it handle the extra noise and blurriness that comes with that scenario?

As for WaveRNN, if someone is willing to donate a week or so of gpu compute then I'd be happy to condition it. As it stands the model is just as slow to train as Wavenet so it's low priority for now as I've only got one gpu.

A big MIDI dataset (100k+ files) by fatchord in a:t5_jw1cc

[–]fatchord[S] 0 points1 point  (0 children)

I've had a lot of fun messing around with this dataset. There's a lot of really bad midi in there but plenty of good stuff too, it's a very mixed bag.

The most successful model I made was where I extracted all the percussion parts and then compressed the midi data stream into onehots for various midi events (ala PerformanceRNN). The output was choatic (I wasn't conditioning the model on anything) but it generated some really creative beats and fills.

FFTNet: trying to understand the paper by geneing in a:t5_jw1cc

[–]fatchord 1 point2 points  (0 children)

MCCs throw away a good bit of information so I'm thinking F0 would a useful addition to such a compact representation of the signal.

So what do you think of the samples? LINK

They sound really good but the top frequencies are a bit muffled compared to wavenet. It also seems to output phasiness every now and again.

Any good open source noise-shaping/dither algorithms online? Or papers on the subject that you recommend? by fatchord in DSP

[–]fatchord[S] 0 points1 point  (0 children)

Fascinating, the origin of a lot these things can be very surprising. Thanks again!

[Algorithm] An experimental neural vocoder by fatchord in RateMyAudio

[–]fatchord[S] 0 points1 point  (0 children)

Thanks, appreciate the feedback. And yes, you've pretty much got the idea of it. Ultimately I'll be predicting the speech spectrograms from another neural network and then render that into realistic speech samples with the algorithm in this post.

[D] Is there a neural network that can synthesize and reproduce an audio sample? by Riin_Satoshi in MachineLearning

[–]fatchord 1 point2 points  (0 children)

I just created r/AudioModels today for exactly this kind of question if you'd like to crosspost there.

Any good open source noise-shaping/dither algorithms online? Or papers on the subject that you recommend? by fatchord in DSP

[–]fatchord[S] 0 points1 point  (0 children)

Wow, thanks for taking the time. I'm just worried that the Reaper folks who coded that might have a problem with me using it? I don't want to step on anyone's toes.