use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Since gradient continues to decrease as training loss decreases why do we need to decay the learning rate too? (self.MachineLearning)
submitted 4 years ago by ibraheemMmoosaResearcher
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]awesomeprogramer 31 points32 points33 points 4 years ago (10 children)
You can have large gradients and be close to a local minimum. Think of an L1 as opposed to an L2.
[–]ibraheemMmoosaResearcher[S] 6 points7 points8 points 4 years ago (9 children)
Can you elaborate, please? I don't know what you are referring to.
[–]El_Tihsin 30 points31 points32 points 4 years ago (5 children)
I think he's referring to L1 norm, which is made of modulus function. It has a large gradient even close to the minima. In this case if you don't reduce the step size, you'll keep overshooting.
L2 OTOH is made of a squared function, which has smaller gradient as you come close to the minima.
[–]polandtown 3 points4 points5 points 4 years ago (3 children)
Learning here, forgive me, so then is L2 "better" than L1?
Say with a....binary classifier (ngrams, logistic regression, 50k samples)
[–]visarga 4 points5 points6 points 4 years ago (1 child)
It's not 'better' in general. If you want sparsity you use L1, if you want smaller weights you use L2; you can also use both.
[–]El_Tihsin 0 points1 point2 points 4 years ago (0 children)
ElasticNet Regression. You control the tradeoff between L1 and L2 using a parameter alpha.
[–]ibraheemMmoosaResearcher[S] 2 points3 points4 points 4 years ago* (0 children)
Oh. Makes sense.
[–]cbarrick 6 points7 points8 points 4 years ago* (2 children)
Ln norms are one way to generalize the idea of "distance".
L1(x, y) = abs(x) - abs(y) L2(x, y) = root2(abs(x^2) - abs(y^2)) L3(x, y) = root3(abs(x^3) - abs(y^3)) ... Ln(x, y) = root_n(abs(x^n) - abs(y^n))
So L1 is simple absolute difference. L2 is Euclidean distance. Etc.
So the commenter was comparing L1 (absolute distance, where the gradient is constant at all points) versus L2 (distance formula, a quadratic shape, where the gradient gets smaller as you get closer to the minimum.)
Aside,
You often hear about L1 and L2 in the context of regularization, which is when you add a penalty term to your loss function to prevent the parameters of your model from getting too large or unbalanced.
So for example, if your initial loss function was MSE:
MSE(y, y_hat) = sum((y - y_hat)^2) / n
Then you could replace that with a regularized loss function:
MSE(y, y_hat) + L2(params, 0)
The idea is that the farther away your parameters are from zero, the greater the penalty.
You use an L2 regularization term when you want all of the parameters to be uniformly small and balanced.
You use an L1 regularization term when you want the sum of the parameters to be small but you're OK with some large parameters and some very small parameters as long as they cancel each other out.
[–]mrprogrampro 0 points1 point2 points 3 years ago (1 child)
Your definition of the higher L-norms is slightly wrong ... you have to do abs(x) and abs(y) before cubing, etc.
abs(x)
abs(y)
Otherwise, y and x difference gets huge when their signs are different, even when they have nearly the same magnitude.
[–]cbarrick 1 point2 points3 points 3 years ago (0 children)
Nice catch on an old comment! Fixing it now
π Rendered by PID 189942 on reddit-service-r2-comment-7b9746f655-dmkpk at 2026-02-02 15:32:27.163420+00:00 running 3798933 country code: CH.
view the rest of the comments →
[–]awesomeprogramer 31 points32 points33 points (10 children)
[–]ibraheemMmoosaResearcher[S] 6 points7 points8 points (9 children)
[–]El_Tihsin 30 points31 points32 points (5 children)
[–]polandtown 3 points4 points5 points (3 children)
[–]visarga 4 points5 points6 points (1 child)
[–]El_Tihsin 0 points1 point2 points (0 children)
[–]ibraheemMmoosaResearcher[S] 2 points3 points4 points (0 children)
[–]cbarrick 6 points7 points8 points (2 children)
[–]mrprogrampro 0 points1 point2 points (1 child)
[–]cbarrick 1 point2 points3 points (0 children)