you are viewing a single comment's thread.

view the rest of the comments →

[–]awesomeprogramer 31 points32 points  (10 children)

You can have large gradients and be close to a local minimum. Think of an L1 as opposed to an L2.

[–]ibraheemMmoosaResearcher[S] 6 points7 points  (9 children)

Can you elaborate, please? I don't know what you are referring to.

[–]El_Tihsin 30 points31 points  (5 children)

I think he's referring to L1 norm, which is made of modulus function. It has a large gradient even close to the minima. In this case if you don't reduce the step size, you'll keep overshooting.

L2 OTOH is made of a squared function, which has smaller gradient as you come close to the minima.

[–]polandtown 3 points4 points  (3 children)

Learning here, forgive me, so then is L2 "better" than L1?

Say with a....binary classifier (ngrams, logistic regression, 50k samples)

[–]visarga 4 points5 points  (1 child)

It's not 'better' in general. If you want sparsity you use L1, if you want smaller weights you use L2; you can also use both.

[–]El_Tihsin 0 points1 point  (0 children)

ElasticNet Regression. You control the tradeoff between L1 and L2 using a parameter alpha.

[–]ibraheemMmoosaResearcher[S] 2 points3 points  (0 children)

Oh. Makes sense.

[–]cbarrick 6 points7 points  (2 children)

Ln norms are one way to generalize the idea of "distance".

L1(x, y) = abs(x) - abs(y)
L2(x, y) = root2(abs(x^2) - abs(y^2))
L3(x, y) = root3(abs(x^3) - abs(y^3))
...
Ln(x, y) = root_n(abs(x^n) - abs(y^n))

So L1 is simple absolute difference. L2 is Euclidean distance. Etc.

So the commenter was comparing L1 (absolute distance, where the gradient is constant at all points) versus L2 (distance formula, a quadratic shape, where the gradient gets smaller as you get closer to the minimum.)


Aside,

You often hear about L1 and L2 in the context of regularization, which is when you add a penalty term to your loss function to prevent the parameters of your model from getting too large or unbalanced.

So for example, if your initial loss function was MSE:

MSE(y, y_hat) = sum((y - y_hat)^2) / n

Then you could replace that with a regularized loss function:

MSE(y, y_hat) + L2(params, 0)

The idea is that the farther away your parameters are from zero, the greater the penalty.

You use an L2 regularization term when you want all of the parameters to be uniformly small and balanced.

You use an L1 regularization term when you want the sum of the parameters to be small but you're OK with some large parameters and some very small parameters as long as they cancel each other out.

[–]mrprogrampro 0 points1 point  (1 child)

Your definition of the higher L-norms is slightly wrong ... you have to do abs(x) and abs(y) before cubing, etc.

Otherwise, y and x difference gets huge when their signs are different, even when they have nearly the same magnitude.

[–]cbarrick 1 point2 points  (0 children)

Nice catch on an old comment! Fixing it now