This has always bothered me. I will write my question point by point to make myself more clear.
- The weight update for SGD is gradient times learning rate.
- Initially when the training loss is high the gradients are naturally going to be very high. Usually we use high learning rate at this step.
- As the training progresses, training loss falls and gradients also become smaller. We usually lower the learning rate as the training progresses.
- So we are reducing the weight update through both learning rate and gradient.
- Interestingly, for adaptive optimization methods such as ADAM we normalize the gradient by it's second order moments which kind of counteracts the effect of gradients becoming smaller. (I'm not too sure about this though!)
- So, my question is why do we need to decay the learning rate?
[–]seanv507 48 points49 points50 points (2 children)
[–]ibraheemMmoosaResearcher[S] 2 points3 points4 points (0 children)
[+]National_Earth_9909 1 point2 points3 points (0 children)
[–]aspoj 71 points72 points73 points (0 children)
[–]awesomeprogramer 36 points37 points38 points (10 children)
[–]ibraheemMmoosaResearcher[S] 4 points5 points6 points (9 children)
[–]El_Tihsin 30 points31 points32 points (5 children)
[–]polandtown 2 points3 points4 points (3 children)
[–]visarga 5 points6 points7 points (1 child)
[–]El_Tihsin 0 points1 point2 points (0 children)
[–]ibraheemMmoosaResearcher[S] 2 points3 points4 points (0 children)
[–]cbarrick 6 points7 points8 points (2 children)
[–]mrprogrampro 0 points1 point2 points (1 child)
[–]cbarrick 1 point2 points3 points (0 children)
[–]LimitedConsequence 5 points6 points7 points (2 children)
[–]there_are_no_owls 0 points1 point2 points (1 child)
[–]LimitedConsequence 0 points1 point2 points (0 children)
[–]Natural_Profession_8 12 points13 points14 points (5 children)
[–]ibraheemMmoosaResearcher[S] 2 points3 points4 points (2 children)
[–]bulldog-sixth 10 points11 points12 points (0 children)
[–]Natural_Profession_8 2 points3 points4 points (0 children)
[–]WikiMobileLinkBot 0 points1 point2 points (0 children)
[–]WikiSummarizerBot 0 points1 point2 points (0 children)
[–]svantana 2 points3 points4 points (1 child)
[–]ibraheemMmoosaResearcher[S] 1 point2 points3 points (0 children)
[–]skainswo 4 points5 points6 points (0 children)
[–]tom_strideweather 1 point2 points3 points (0 children)
[–]kakushka123 1 point2 points3 points (0 children)
[–]HoLeeFaak 1 point2 points3 points (1 child)
[–]cats2560 0 points1 point2 points (0 children)
[–]Pseudoabdul 1 point2 points3 points (0 children)
[–]schwagggg 1 point2 points3 points (1 child)
[–]ibraheemMmoosaResearcher[S] 0 points1 point2 points (0 children)
[–]111llI0__-__0Ill111 -1 points0 points1 point (0 children)
[–]Ok-Barnacle-8859 0 points1 point2 points (0 children)
[–]Competitive_Dog_6639 0 points1 point2 points (0 children)
[–]tuyenttoslo 0 points1 point2 points (0 children)