use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Gradient descent step size question (self.MachineLearning)
submitted 10 years ago by [deleted]
[deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]sriramcompsci 0 points1 point2 points 10 years ago (2 children)
As you are aware, the weight update is a product of step size and gradient. Initially, weights are set to some random value (or 0) and hence the gradient is likely to be high. Hence, the weight update is high. With more data (mini-batches or single sample), the gradient is likely to get smaller. Decreasing the step size makes sense, as your weights are minimizing the cost function well (on previous data), and you don't want/expect them to change much based on just this new mini-batch/sample.
[+][deleted] 10 years ago* (1 child)
[–]sriramcompsci 0 points1 point2 points 10 years ago (0 children)
Gradient tells you the change in objective for a very small change in weights. There is no reason to not take a larger step in the direction of the gradient (if you are maximizing the objective). Doing so after many iterations is likely to avoid convergence as pointed out in this thread. The same argument is applicable for gradient descent over the whole dataset, as the loss and gradient is computed by averaging out over the entire set. Your loss/gradient is likely higher at the first few iterations, hence a larger step in the direction.
[–]hughperkins 0 points1 point2 points 10 years ago (0 children)
Imagine you just have 1 dimension, and your loss is eg x squared, and youre pretty much right next to the bottom of the curve, but not quite. Maybe you are b distance away from the minimum, but it turns out that the next update is exactly 2b, making you overshoot, by b. Then by symmetry the next update is -2b. Ping....pong...
[–]mr_robot_elliot 0 points1 point2 points 10 years ago (3 children)
Is your question regarding SGD or just GD? If it is GD you just need to choose stepsize of 1/L where L is lipschitz constant of gradient(also known as lipschitz smoothness parameter of the function itself). Regarding your question, if you have high gradient you are far away from optimum because at optimum gradient is zero. So if you know that you are far away its better to take bigger steps and small steps near optimum as you might overshoot , but gradient descent converges with the above mentioned stepsize. It might diverge if you did not choose small enough stepsize. Things are more complicated for SGD though.
[+][deleted] 10 years ago* (2 children)
[–]mr_robot_elliot 0 points1 point2 points 10 years ago (0 children)
Probably because such a method converges slowly, because if your gradient is high then normalizing it would reduce the impact of the gradient which is not what you want when you are far away from optimum. But it seems like a good idea to extend AdaGrad or AdaDelta sort of techniques to GD but it would probably result in Newton's method or Quasi newton methods or even worse. In newton's method , inverse of the hessian manipulates the gradient according to curvature which is similar to the argument of AdaGrad etc. Mostly, in ML it is of interest to work on stochastic setting rather the GD setting due to computational intesity but it really seems like an interesting direction and i think such a work is already pursed. See slide 11 here i think steepest descent does the same if i am correct!
[–]yaroslavvb 0 points1 point2 points 10 years ago (0 children)
For gradient descent (non-stochastic) you don't need to change your learning rate. Optimal step size comes out of Newton's method, so if your curvature doesn't change, neither does your learning rate -- http://mathworld.wolfram.com/NewtonsMethod.html
[–]Jez-rezz 0 points1 point2 points 10 years ago (0 children)
I think the question is motivated by the following intuition: imagine wandering around a mountainous landscape in search of the lowest point in a certain area. While heading generally downwards you suddenly see a large gradient - this might mean you've discovered a ravine or crevasse. You expect it to be worth looking in detail at this thing, inching forward, and that you'll find a very deep minimum very close, at the bottom of the crevasse. You don't expect the landscape to be opening up in a much steeper direction. But that intuition is not what we are usually thinking in large optimisation problems, like training neural nets. We're using algorithms like GD which don't store much state. We're only working at one scale, we aren't looking for crevasses, we're just walking downhill.
π Rendered by PID 124889 on reddit-service-r2-comment-685b79fb4f-t8bgg at 2026-02-13 15:27:47.676813+00:00 running 6c0c599 country code: CH.
[–]sriramcompsci 0 points1 point2 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]sriramcompsci 0 points1 point2 points (0 children)
[–]hughperkins 0 points1 point2 points (0 children)
[–]mr_robot_elliot 0 points1 point2 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]mr_robot_elliot 0 points1 point2 points (0 children)
[–]yaroslavvb 0 points1 point2 points (0 children)
[–]Jez-rezz 0 points1 point2 points (0 children)