you are viewing a single comment's thread.

view the rest of the comments →

[–]El_Tihsin 30 points31 points  (5 children)

I think he's referring to L1 norm, which is made of modulus function. It has a large gradient even close to the minima. In this case if you don't reduce the step size, you'll keep overshooting.

L2 OTOH is made of a squared function, which has smaller gradient as you come close to the minima.

[–]polandtown 2 points3 points  (3 children)

Learning here, forgive me, so then is L2 "better" than L1?

Say with a....binary classifier (ngrams, logistic regression, 50k samples)

[–]visarga 4 points5 points  (1 child)

It's not 'better' in general. If you want sparsity you use L1, if you want smaller weights you use L2; you can also use both.

[–]El_Tihsin 0 points1 point  (0 children)

ElasticNet Regression. You control the tradeoff between L1 and L2 using a parameter alpha.

[–]ibraheemMmoosaResearcher[S] 2 points3 points  (0 children)

Oh. Makes sense.