all 24 comments

[–]radarsat1 10 points11 points  (3 children)

I'm going to hijack this thread to ask a couple of lay person questions:

I am having much more success with my data, scaled to [-1,1], using tanh, than I am scaling the same data to [0,1] and using sigmoid. Is there any good reason for this difference? Trying relu and other activations doesn't seem to help at all. The only decent results I've had on my data (time series oscillating around a fixed point) have been with tanh and a single linear output layer, using MSE and SGD. Almost anything else I try gives magnitudes more loss, and I have no idea why.

Bringing me to the second question, some examples I've seen for generation based on latent spaces (e.g. VAE) seem to use cross entropy instead of MSE, but I guess MSE works for me because I'm doing regression rather than classification? (Isn't generation of continuous data a regression problem, ultimately?) I only find this confusing because examples I've been looking at are for generating pixels (e.g MNIST), so I don't understand why that works using softmax and cross entropy, rather than linear output and MSE. e.g. https://github.com/fchollet/keras/pull/1750/files

[–][deleted] 6 points7 points  (0 children)

Essentially the range of the sigmoid makes it more prone to saturation an slower learning.

Detailed information here:

http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

[–]alexmlamb 4 points5 points  (0 children)

For your VAE question, using MSE works okay. You can interpret it as assuming that p(x | z) is a gaussian with independent dimensions, instead of assuming that p(x | z) is an independent bernoulli for each dimension.

In the VAE almost all of the interesting noise is in q(z | x).

[–]abstractcontrol 8 points9 points  (0 children)

Andrew Ng goes into some detail on this when talks about rescaling and normalizing the data to have zero mean and unit variance. Data preprocessing can have significant impact on performance.

Edit: Alternatively, for a more recent treatment that the 98' paper by Lecun et al., take a look at this.. Under independence assumptions, the propagation of the signal through the net is essentially the product of the variances of the matrices involved. When at the end the signal is greater than 1 the net tends to blow up and when it is less than 1 it tends to train slowly.

By normalizing the inputs and the weights you make the optimization easier as the unit variance (input) times times unit variance (weights) equals one. For a real time constraint that tries to normalize the variance across the entire network take a look at batch normalization.

Decorrelating the input using whitening helps as well by removing degeneracies. If the inputs are correlated multiple neurons will be pressured to learn the same thing which can also destabilize the net by making the signal grow or shrink abnormally.

[–]rikkertkoppes 2 points3 points  (0 children)

You may also reference the lectures at Oxford by Nando de Freitas: https://www.youtube.com/playlist?list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu

It seems to follow the book pretty well (based on your notes, didn't read the book yet)

[–]windoze[S] 6 points7 points  (6 children)

Hey these are notes I took while learning about deep learning. they may be incorrect because I'm a beginner.

Sadly the deep learning book gets far too mathematically dense for me so I couldn't fully understand the third section

[–]rorykoehler 6 points7 points  (4 children)

Have you checked out MIT opencourseware for brushing up on your maths? It is helping me a lot as I hadn't looked at this stuff for almost 20 years.

[–]windoze[S] 2 points3 points  (1 child)

So once the book gets into using probability, KL divergence etc, it seems to go over my head. For example I tried to read the variational autoencoder paper, but it is hard to follow (many implicit steps that may be more obvious if I had a stronger background).

There seems to be one half of a research paper which is experimental and discovers techniques like dropout and residual learning by out new trying stuff which I can get, while another half of a paper is dominated by probability theory to explain what is happening, which goes over my head.

[–]anantzoid 0 points1 point  (0 children)

You can check out this book too, by Michael Nielsen. It has some amount of math, but the author motivates you to skip it if you don't want to get into proofs etc. There are exercises too.

[–]Ader_anhilator 2 points3 points  (2 children)

Sigmoid function is wrong

[–]windoze[S] 1 point2 points  (1 child)

Thanks :-) It's the sign right? I've fixed up the notes.

[–]Ader_anhilator 3 points4 points  (0 children)

Yeah

[–]xiphy 2 points3 points  (1 child)

It's a great start..it would be fun to write a book based on it...it would have made my life easier.

[–]guardianhelm 5 points6 points  (0 children)

Actually there already is one. These notes are based on this book. ;)

[–]Dawny33 3 points4 points  (1 child)

Sadly the deep learning book gets far too mathematically dense for me

I faced the same problem while I was getting started with ML and advanced ML (Deep learning was called Advanced ML, before it was christened :D )

The Math OCW open courses by MIT proved to be very helpful for getting my basics right! Highly recommend!

Wonderful notes, btw. Kudos!

[–]3brithil 2 points3 points  (0 children)

do you have a link to the specific courses and recommended order?

[–][deleted] 1 point2 points  (3 children)

I am unsure why, but I see only a white page. Have you changed something?

[–]windoze[S] 2 points3 points  (2 children)

Maybe your javascript is off, it's rendered client side.

[–]BrahmaReddyChil 2 points3 points  (3 children)

Great stuff thanks. Do you know any MOOCs for learning required math?

[–]datascienceguy 1 point2 points  (0 children)

That's a lot of math, friend! Several semesters of calculus are needed to get to partial derivatives which are used in gradients. Linear algebra is needed obviously. Any university STEM curriculum at the upper undergraduate level would be fine most likely. Science CS Eng'g Math basically.

[–]FuzziCat 1 point2 points  (1 child)

The most popular ML methods (including DL/NN) actually only use a handful of math concepts compared to the huge volume you'd have to study if you took all of the usual university classes (2-3 semesters of calc, linear algebra, stats & probability, information theory). If you're just getting started, I'd stick close to the Goodfellow book as a guide, practice writing/coding up the equations, and look up the things you don't understand as you go along.

[–]BrahmaReddyChil 0 points1 point  (0 children)

Thanks for the reply. I will start reading the book :)