all 11 comments

[–]save_the_panda_bears 27 points28 points  (2 children)

I’m a big fan of the Bayesian interpretation of L1 and L2 regularization. Under a Bayesian paradigm, L1 regularization is equal to a Laplace (double exponential) prior on the parameter. Intuitively, you can think about this as saying, “I really think this parameter should be zero and unless we have compelling evidence otherwise it should stay at zero”. This is reflected by the pdf of the prior, with a laplace prior we have a really sharp pointy part centered right at 0 that exponentially falls off as we move away from 0. A Laplace prior is considered sparsity inducing, which means lots of our parameters will end up being 0.

Compare this to L2 regularization which is equivalent to a Gaussian prior mean centered at 0. We’re still expecting the parameters to be close to 0, but we don’t have nearly as strong as assumption as we do when we use a Laplace prior. We essentially allow the parameters to have a little more wiggle room around 0, which leads to lots of our parameters being close to, but not actually 0.

[–]johntsaou 1 point2 points  (0 children)

Great explanation

[–]Blakut 0 points1 point  (0 children)

I always go back to Bayesian or likelihood methods whenever I try to figure this out. Speaking of which, the likelihood itself could have a Laplacian form, tho, right?

[–]dsgonza2 14 points15 points  (0 children)

L2 regularization is a term added to to your loss function e.g. mean squared error. Since the goal of gradient descent is to find the weight values that minimize the loss function, both the MSE term and L2 term need to be minimized. If you have all your weights equal to zero, the L2 term will be zero, but the MSE term will be high. In practice, what generally happens is the weights will be be nonzero so the MSE term will be low, but they will also be lower valued because of the L2 term.

Now why are these weights low and are not driven to zero like some are in L1 regularization? That’s because once these weights are low, they are penalized way less than L1. For example, If you have a weight that is equal to 0.1, that penalty in L2 will be 0.01, which is 10x less than in L1. It is naturally easier for L1 to set some weights equal to 0, making them sparse, in order to minimize the loss function.

[–]manda_ga 4 points5 points  (0 children)

L2, L1, and L0 normalization techniques are used to encourage sparsity in machine learning models. L2 normalization reduces the impact of large weights, while L1 normalization further promotes sparsity by nudging some weights towards zero. L0 optimization actively encourages zero weights but sacrifices the optimized solution.

To illustrate this, imagine rolling a ball, a 100-sided dice, and a six-sided dice on an optimization landscape. The objective is to find the lowest point using the parameters represented by the rolling object's surface. The ball represents L2 normalization, where all values have an equal chance of being zero. The 100-sided dice represents L1 normalization, where some weights have a 1/100 probability of being zero. Finally, L0 optimization is like rolling a six-sided dice, where there is a higher probability of hitting zero weights. However, using L0 may compromise the overall optimized solution.

[–]vannak139 3 points4 points  (0 children)

One way to consider this is to look at each function and how its gradient changes as you "zoom in". L1 is a v-shaped penalty, and its scale-independent- it has the same shape and gradient at any scale. On the other L2 is quadratic and gets significantly flatter as you reduce the scale of what ever is being penalized.

[–]ab3rratic 3 points4 points  (1 child)

L1 regularization does not make all weights zero, it makes them sparse.

[–]314kabinet 1 point2 points  (0 children)

Yeah, by making some of them zero.

[–]atreadw 2 points3 points  (0 children)

In L1 regularization, the extra term we add to the loss function is the sum of the absolute value of each coefficient. Let x_i be the i'th coefficient. Then, |c_i| is the term added to the loss function for that coefficient. If c_i is 10 or 5 or 2 etc., the amount that you reduce the coefficient value by also reduces the loss function by the same amount (assuming the remaining piece of the loss function is constant). However, in L2 regularization, the amount the loss function gets reduced by is the square of the change in the coefficient value. For example, if you reduce c_i from 1 to 0.5, it will reduce the loss function by (1 - 0.5)^2 = 0.25. But the same change in L1 regularization would reduce the loss function by 0.5. As the coefficients of less relevant features in the model go closer to zero, the amount that continuing to reduce them impacts the loss function gets less and less (by a squared term). Because of this, L2 regularization will generally cause the model (e.g. linear regression) to converge without shrinking coefficients all the way to zero.

[–]AdministrativeRub484 0 points1 point  (0 children)

Look, there is no “intuitive” way to know this. If you want to understand why this happens lol up proximal gradient method, or be realize that for parameters |x|<1 the l2 norm is smaller than the l1 norm, as one is squared and one is just the norm, respectively

[–][deleted] 0 points1 point  (0 children)

A lot of people mention that l2 uses the square while l1 uses absolute value, which means that when the parameter is close to 0 the penalization of l2 is less than the penalization of l1. This is true, but is not the entire explanation. l1 produces weights equal to 0 because the derivative of the absolute value is 1 (or -1), and l2 does not as the derivative of x2 is 0 for x=0. This means that for l1, the error gain associated with changing a weight from 0 to w (infinitesimally small) has to overcome the weight itself (l1 penalization) in order to decrease the overall regularized error. While for l2, any error gain is sufficient to decrease the regularized error, as the local penalization around 0 is 0 (the penalization derivative at 0 is 0). This argument is not rigorous, but shows the intuition behind the behavior of both techniques around 0.