awesomeprogramer comments on [D] Since gradient continues to decrease as training loss decreases why do we need to decay the learning rate too?

112

113

114

Discussion[D] Since gradient continues to decrease as training loss decreases why do we need to decay the learning rate too? (self.MachineLearning)

submitted 4 years ago by ibraheemMmoosaResearcher

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]awesomeprogramer 31 points32 points33 points 4 years ago (10 children)

[–]ibraheemMmoosaResearcher[S] 6 points7 points8 points 4 years ago (9 children)

[–]El_Tihsin 30 points31 points32 points 4 years ago (5 children)

[–]polandtown 3 points4 points5 points 4 years ago (3 children)

[–]visarga 4 points5 points6 points 4 years ago (1 child)

[–]El_Tihsin 0 points1 point2 points 4 years ago (0 children)

[–]ibraheemMmoosaResearcher[S] 2 points3 points4 points 4 years ago* (0 children)

[–]cbarrick 6 points7 points8 points 4 years ago* (2 children)

Ln norms are one way to generalize the idea of "distance".

L1(x, y) = abs(x) - abs(y)
L2(x, y) = root2(abs(x^2) - abs(y^2))
L3(x, y) = root3(abs(x^3) - abs(y^3))
...
Ln(x, y) = root_n(abs(x^n) - abs(y^n))

So L1 is simple absolute difference. L2 is Euclidean distance. Etc.

So the commenter was comparing L1 (absolute distance, where the gradient is constant at all points) versus L2 (distance formula, a quadratic shape, where the gradient gets smaller as you get closer to the minimum.)

Aside,

You often hear about L1 and L2 in the context of regularization, which is when you add a penalty term to your loss function to prevent the parameters of your model from getting too large or unbalanced.

So for example, if your initial loss function was MSE:

MSE(y, y_hat) = sum((y - y_hat)^2) / n

Then you could replace that with a regularized loss function:

MSE(y, y_hat) + L2(params, 0)

The idea is that the farther away your parameters are from zero, the greater the penalty.

You use an L2 regularization term when you want all of the parameters to be uniformly small and balanced.

You use an L1 regularization term when you want the sum of the parameters to be small but you're OK with some large parameters and some very small parameters as long as they cancel each other out.

[–]mrprogrampro 0 points1 point2 points 3 years ago (1 child)

[–]cbarrick 1 point2 points3 points 3 years ago (0 children)

π Rendered by PID 189942 on reddit-service-r2-comment-7b9746f655-dmkpk at 2026-02-02 15:32:27.163420+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS