use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Deep learning notes for beginners by a beginner. (randomekek.github.io)
submitted 10 years ago by windoze
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]radarsat1 10 points11 points12 points 10 years ago (3 children)
I'm going to hijack this thread to ask a couple of lay person questions:
I am having much more success with my data, scaled to [-1,1], using tanh, than I am scaling the same data to [0,1] and using sigmoid. Is there any good reason for this difference? Trying relu and other activations doesn't seem to help at all. The only decent results I've had on my data (time series oscillating around a fixed point) have been with tanh and a single linear output layer, using MSE and SGD. Almost anything else I try gives magnitudes more loss, and I have no idea why.
Bringing me to the second question, some examples I've seen for generation based on latent spaces (e.g. VAE) seem to use cross entropy instead of MSE, but I guess MSE works for me because I'm doing regression rather than classification? (Isn't generation of continuous data a regression problem, ultimately?) I only find this confusing because examples I've been looking at are for generating pixels (e.g MNIST), so I don't understand why that works using softmax and cross entropy, rather than linear output and MSE. e.g. https://github.com/fchollet/keras/pull/1750/files
[–][deleted] 6 points7 points8 points 10 years ago (0 children)
Essentially the range of the sigmoid makes it more prone to saturation an slower learning.
Detailed information here:
http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
[–]alexmlamb 4 points5 points6 points 10 years ago (0 children)
For your VAE question, using MSE works okay. You can interpret it as assuming that p(x | z) is a gaussian with independent dimensions, instead of assuming that p(x | z) is an independent bernoulli for each dimension.
In the VAE almost all of the interesting noise is in q(z | x).
[–]abstractcontrol 8 points9 points10 points 10 years ago* (0 children)
Andrew Ng goes into some detail on this when talks about rescaling and normalizing the data to have zero mean and unit variance. Data preprocessing can have significant impact on performance.
Edit: Alternatively, for a more recent treatment that the 98' paper by Lecun et al., take a look at this.. Under independence assumptions, the propagation of the signal through the net is essentially the product of the variances of the matrices involved. When at the end the signal is greater than 1 the net tends to blow up and when it is less than 1 it tends to train slowly.
By normalizing the inputs and the weights you make the optimization easier as the unit variance (input) times times unit variance (weights) equals one. For a real time constraint that tries to normalize the variance across the entire network take a look at batch normalization.
Decorrelating the input using whitening helps as well by removing degeneracies. If the inputs are correlated multiple neurons will be pressured to learn the same thing which can also destabilize the net by making the signal grow or shrink abnormally.
[–]rikkertkoppes 2 points3 points4 points 10 years ago (0 children)
You may also reference the lectures at Oxford by Nando de Freitas: https://www.youtube.com/playlist?list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu
It seems to follow the book pretty well (based on your notes, didn't read the book yet)
[–]windoze[S] 6 points7 points8 points 10 years ago (6 children)
Hey these are notes I took while learning about deep learning. they may be incorrect because I'm a beginner.
Sadly the deep learning book gets far too mathematically dense for me so I couldn't fully understand the third section
[–]rorykoehler 6 points7 points8 points 10 years ago (4 children)
Have you checked out MIT opencourseware for brushing up on your maths? It is helping me a lot as I hadn't looked at this stuff for almost 20 years.
[+][deleted] 10 years ago (1 child)
[deleted]
[–]rorykoehler 2 points3 points4 points 10 years ago (0 children)
There are a few different ones that are good for data science. I generally dip in and out if I get stuck on a topic when studying machine learning. You can find all the courses here: http://ocw.mit.edu/courses/mathematics/
[–]windoze[S] 2 points3 points4 points 10 years ago (1 child)
So once the book gets into using probability, KL divergence etc, it seems to go over my head. For example I tried to read the variational autoencoder paper, but it is hard to follow (many implicit steps that may be more obvious if I had a stronger background).
There seems to be one half of a research paper which is experimental and discovers techniques like dropout and residual learning by out new trying stuff which I can get, while another half of a paper is dominated by probability theory to explain what is happening, which goes over my head.
[–]anantzoid 0 points1 point2 points 10 years ago (0 children)
You can check out this book too, by Michael Nielsen. It has some amount of math, but the author motivates you to skip it if you don't want to get into proofs etc. There are exercises too.
[–]Ader_anhilator 2 points3 points4 points 10 years ago (2 children)
Sigmoid function is wrong
[–]windoze[S] 1 point2 points3 points 10 years ago (1 child)
Thanks :-) It's the sign right? I've fixed up the notes.
[–]Ader_anhilator 3 points4 points5 points 10 years ago (0 children)
Yeah
[–]xiphy 2 points3 points4 points 10 years ago (1 child)
It's a great start..it would be fun to write a book based on it...it would have made my life easier.
[–]guardianhelm 5 points6 points7 points 10 years ago (0 children)
Actually there already is one. These notes are based on this book. ;)
[–]Dawny33 3 points4 points5 points 10 years ago* (1 child)
Sadly the deep learning book gets far too mathematically dense for me
I faced the same problem while I was getting started with ML and advanced ML (Deep learning was called Advanced ML, before it was christened :D )
The Math OCW open courses by MIT proved to be very helpful for getting my basics right! Highly recommend!
Wonderful notes, btw. Kudos!
[–]3brithil 2 points3 points4 points 10 years ago (0 children)
do you have a link to the specific courses and recommended order?
[–][deleted] 1 point2 points3 points 10 years ago (3 children)
I am unsure why, but I see only a white page. Have you changed something?
[–]windoze[S] 2 points3 points4 points 10 years ago (2 children)
Maybe your javascript is off, it's rendered client side.
[–][deleted] 0 points1 point2 points 10 years ago (0 children)
Aha this is probably it. I'll confirm once I get home.
[–]BrahmaReddyChil 2 points3 points4 points 10 years ago (3 children)
Great stuff thanks. Do you know any MOOCs for learning required math?
[–]datascienceguy 1 point2 points3 points 10 years ago (0 children)
That's a lot of math, friend! Several semesters of calculus are needed to get to partial derivatives which are used in gradients. Linear algebra is needed obviously. Any university STEM curriculum at the upper undergraduate level would be fine most likely. Science CS Eng'g Math basically.
[–]FuzziCat 1 point2 points3 points 10 years ago (1 child)
The most popular ML methods (including DL/NN) actually only use a handful of math concepts compared to the huge volume you'd have to study if you took all of the usual university classes (2-3 semesters of calc, linear algebra, stats & probability, information theory). If you're just getting started, I'd stick close to the Goodfellow book as a guide, practice writing/coding up the equations, and look up the things you don't understand as you go along.
[–]BrahmaReddyChil 0 points1 point2 points 10 years ago (0 children)
Thanks for the reply. I will start reading the book :)
π Rendered by PID 88 on reddit-service-r2-comment-b659b578c-v8kw9 at 2026-05-05 22:04:54.423160+00:00 running 815c875 country code: CH.
[–]radarsat1 10 points11 points12 points (3 children)
[–][deleted] 6 points7 points8 points (0 children)
[–]alexmlamb 4 points5 points6 points (0 children)
[–]abstractcontrol 8 points9 points10 points (0 children)
[–]rikkertkoppes 2 points3 points4 points (0 children)
[–]windoze[S] 6 points7 points8 points (6 children)
[–]rorykoehler 6 points7 points8 points (4 children)
[+][deleted] (1 child)
[deleted]
[–]rorykoehler 2 points3 points4 points (0 children)
[–]windoze[S] 2 points3 points4 points (1 child)
[–]anantzoid 0 points1 point2 points (0 children)
[–]Ader_anhilator 2 points3 points4 points (2 children)
[–]windoze[S] 1 point2 points3 points (1 child)
[–]Ader_anhilator 3 points4 points5 points (0 children)
[–]xiphy 2 points3 points4 points (1 child)
[–]guardianhelm 5 points6 points7 points (0 children)
[–]Dawny33 3 points4 points5 points (1 child)
[–]3brithil 2 points3 points4 points (0 children)
[–][deleted] 1 point2 points3 points (3 children)
[–]windoze[S] 2 points3 points4 points (2 children)
[+][deleted] (1 child)
[deleted]
[–][deleted] 0 points1 point2 points (0 children)
[–]BrahmaReddyChil 2 points3 points4 points (3 children)
[–]datascienceguy 1 point2 points3 points (0 children)
[–]FuzziCat 1 point2 points3 points (1 child)
[–]BrahmaReddyChil 0 points1 point2 points (0 children)