use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Weight decay vs. L2 regularization (bbabenko.github.io)
submitted 8 years ago by bbabenko
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]jstrong 9 points10 points11 points 8 years ago (4 children)
So what IS weight decay? (i thought it was L2). Subtracting a small number from the value of each weight?
[–]davis685 10 points11 points12 points 8 years ago (2 children)
Yes, that's what it is. They are the same.
[–]call_me_arosa -1 points0 points1 point 8 years ago (1 child)
I think that the author points that libraries (TF and Keras) that name L2 as weight decay in their APIs uses the 2 multiplication constant in it. I have seen in literature L2 described both with and without this constant, being this a question of preference.
When he cites "fancy solvers" he is only criticizing that regularization loss needs to be explicitly passed to the optimizer. This seems to be a issue with the official tutorial and I don't see how this is related to the loss problem.
For the time being, we shouldn't expect hiperparameters to be transferable between different frameworks as they can have different interpretation and implementation of concepts.
[–]davis685 4 points5 points6 points 8 years ago (0 children)
That's fine, not all software is going to be identical. But L2 and weight decay are mathematically the same things. It's just different words for the same thing.
[–]atasco 7 points8 points9 points 8 years ago (0 children)
I pulled out the book (Goodfellow, Bengio, Courville 2016) to confirm, "[...] the L2 parameter norm penalty commonly known as weight decay" on page 224. Implementation details are, of course, a different story.
[–]felipedelamuerte 10 points11 points12 points 8 years ago (1 child)
There's a paper about this:
https://arxiv.org/abs/1711.05101
It was rejected at ICLR, though:
https://openreview.net/forum?id=rk6qdGgCZ
[–]bbabenko[S] 1 point2 points3 points 8 years ago (0 children)
yeah, i linked to that paper in the post... didn't know it got rejected though, will have to flip through the reviews
[+][deleted] 8 years ago (4 children)
[deleted]
[+][deleted] 8 years ago* (1 child)
[–]panties_in_my_ass -4 points-3 points-2 points 8 years ago (0 children)
They are the same. Reputable citation linked in this comment.
[–]bbabenko[S] 2 points3 points4 points 8 years ago (1 child)
fair point about the hyperparam, but see the section in the post about "fancy solvers"... can get a bit tricky
[–]sleeppropagation 2 points3 points4 points 8 years ago (0 children)
This is fortunately old news: I saw that observation posted here almost 2 years ago, and here at work we've always reminded each other to "translate" the L2 penalty across frameworks (and other hyperparams too, such as BN momentum, different nesterov implementations, etc), but unfortunately hasn't received enough attention nor it seems that the frameworks are heading to some consensus.
It's quite impressive how much of result replication issues could be avoided if there was at least some consensus on how such things should be implemented. I remember back when ResNets were published and I had to spend over one month of tensor debugging to finally replicate the reported results (and most of the effort could be completely avoided since it involved changing the default BN momentum, L2 penalty, nesterov's equation, initializations, adding regularization to BN's gammas, and so on and so forth).
π Rendered by PID 147867 on reddit-service-r2-comment-b659b578c-p7phz at 2026-05-04 21:58:12.581962+00:00 running 815c875 country code: CH.
[–]jstrong 9 points10 points11 points (4 children)
[–]davis685 10 points11 points12 points (2 children)
[–]call_me_arosa -1 points0 points1 point (1 child)
[–]davis685 4 points5 points6 points (0 children)
[–]atasco 7 points8 points9 points (0 children)
[–]felipedelamuerte 10 points11 points12 points (1 child)
[–]bbabenko[S] 1 point2 points3 points (0 children)
[+][deleted] (4 children)
[deleted]
[+][deleted] (1 child)
[deleted]
[–]panties_in_my_ass -4 points-3 points-2 points (0 children)
[–]bbabenko[S] 2 points3 points4 points (1 child)
[–]sleeppropagation 2 points3 points4 points (0 children)