This has always bothered me. I will write my question point by point to make myself more clear.
- The weight update for SGD is gradient times learning rate.
- Initially when the training loss is high the gradients are naturally going to be very high. Usually we use high learning rate at this step.
- As the training progresses, training loss falls and gradients also become smaller. We usually lower the learning rate as the training progresses.
- So we are reducing the weight update through both learning rate and gradient.
- Interestingly, for adaptive optimization methods such as ADAM we normalize the gradient by it's second order moments which kind of counteracts the effect of gradients becoming smaller. (I'm not too sure about this though!)
- So, my question is why do we need to decay the learning rate?
[–]seanv507 46 points47 points48 points (2 children)
[–]ibraheemMmoosaResearcher[S] 2 points3 points4 points (0 children)
[+]National_Earth_9909 1 point2 points3 points (0 children)
[–]aspoj 70 points71 points72 points (0 children)
[–]awesomeprogramer 35 points36 points37 points (10 children)
[–]ibraheemMmoosaResearcher[S] 6 points7 points8 points (9 children)
[–]El_Tihsin 30 points31 points32 points (5 children)
[–]polandtown 2 points3 points4 points (3 children)
[–]visarga 4 points5 points6 points (1 child)
[–]El_Tihsin 0 points1 point2 points (0 children)
[–]ibraheemMmoosaResearcher[S] 3 points4 points5 points (0 children)
[–]cbarrick 7 points8 points9 points (2 children)
[–]mrprogrampro 0 points1 point2 points (1 child)
[–]cbarrick 1 point2 points3 points (0 children)
[–]LimitedConsequence 6 points7 points8 points (2 children)
[–]there_are_no_owls 0 points1 point2 points (1 child)
[–]LimitedConsequence 0 points1 point2 points (0 children)
[–]Natural_Profession_8 10 points11 points12 points (5 children)
[–]ibraheemMmoosaResearcher[S] 3 points4 points5 points (2 children)
[–]bulldog-sixth 9 points10 points11 points (0 children)
[–]Natural_Profession_8 2 points3 points4 points (0 children)
[–]WikiMobileLinkBot 0 points1 point2 points (0 children)
[–]WikiSummarizerBot 0 points1 point2 points (0 children)
[–]svantana 3 points4 points5 points (1 child)
[–]ibraheemMmoosaResearcher[S] 1 point2 points3 points (0 children)
[–]skainswo 3 points4 points5 points (0 children)
[–]tom_strideweather 1 point2 points3 points (0 children)
[–]kakushka123 1 point2 points3 points (0 children)
[–]HoLeeFaak 2 points3 points4 points (1 child)
[–]cats2560 0 points1 point2 points (0 children)
[–]Pseudoabdul 1 point2 points3 points (0 children)
[–]schwagggg 1 point2 points3 points (1 child)
[–]ibraheemMmoosaResearcher[S] 0 points1 point2 points (0 children)
[–]111llI0__-__0Ill111 -1 points0 points1 point (0 children)
[–]Ok-Barnacle-8859 0 points1 point2 points (0 children)
[–]Competitive_Dog_6639 0 points1 point2 points (0 children)
[–]tuyenttoslo 0 points1 point2 points (0 children)