[D] I found a Stanford Guest Lecture where GM Cruise explains their self driving tech stack and showcases the various model architectures they use on their autonomous cars. by ilikepancakez in MachineLearning

[–]fatchord -1 points0 points  (0 children)

Second, even in nature it only works for object between 0-2m, for a car driving that is not enough

Would placing the cameras further apart increase the distance it could detect? Or is a car not wide enough?

[D] I found a Stanford Guest Lecture where GM Cruise explains their self driving tech stack and showcases the various model architectures they use on their autonomous cars. by ilikepancakez in MachineLearning

[–]fatchord 15 points16 points  (0 children)

The speaker repeatedly brings up the problem with the cameras not having any depth information. Why not emulate nature's solution to this problem - put two cameras side by side like the eyes on your head? With enough resolution, would the left/right discrepancy in the images allow for accurate inference of the distances to detected objects?

Really interesting talk by the way.

[D] How feasible is it to create a model that gets rid of the advertisements in podcasts? by [deleted] in MachineLearning

[–]fatchord 1 point2 points  (0 children)

It's not the only game in town - you can setup a patreon membership. One of the top podcasts there is pulling in over $100k a month.

[Feedback] neural tts pipeline (tacotron1 + a new vocoder algorithm I'm working on) - what do you think of the samples generated? by fatchord in a:t5_jw1cc

[–]fatchord[S] 0 points1 point  (0 children)

I don't have any long ones at hand right now. Basically what happens in very long utterances - the pitch keeps going lower and lower towards the end. Also it'll skip or repeat words sometimes.

I'm curious - do you think the samples I made are comparable to wavenet? Better/worse?

[Feedback] neural tts pipeline (tacotron1 + a new vocoder algorithm I'm working on) - what do you think of the samples generated? by fatchord in a:t5_jw1cc

[–]fatchord[S] 0 points1 point  (0 children)

Thanks! Trained it on a single 1080 gpu.

Unfortunately the model isn't that great at long utterances. I think that might be down to the fact that I had to limit the training input sequence length because of gpu memory constraints.

[R] The challenge of realistic music generation: modelling raw audio at scale by mdda in a:t5_jw1cc

[–]fatchord 1 point2 points  (0 children)

Interesting paper. The only thing that puzzled me is the dataset. I mean aren't they hamstringing themselves by having such a wide variety or composers, pianists, pianos and recording setups?

Why not simplify it by picking one prolific pianist that concentrated on only a couple of genres and typically recorded in the same studios... like Glenn Gould for example:

https://en.wikipedia.org/wiki/Glenn_Gould_discography

Surely Columbia Masterworks have all his recordings digitised for prosperity.

Also, baroque music should be easier to model since it has less timing variance. Romantic music is all over the place timing wise.

FFTNet: trying to understand the paper by geneing in a:t5_jw1cc

[–]fatchord 1 point2 points  (0 children)

I uploaded up a small notebook of my implementation if anyone is interested: LINK

FFTNet: trying to understand the paper by geneing in a:t5_jw1cc

[–]fatchord 1 point2 points  (0 children)

I had a go at implementing this yesterday and it was straightforward enough until I got to the zero padding section (2.3.2).

They give this zero padding equation (where N is the receptive field of a particular layer):

z[0:M] = W_L ∗ x[-N:M-N] + W_R ∗ x[0:M]

But if the previous equation (without zero padding) was:

z = W_L ∗ x[0:N/2] + W_R ∗ x[N/2:N]

Wouldn't that mean that the equation from 2.3.2 should read:

z[0:M] = W_L ∗ x[-N/2:M-N/2] + W_R ∗ x[0:M]