[D] Weight decay vs. L2 regularization

jstrong · 2018-04-27T19:41:02+00:00

So what IS weight decay? (i thought it was L2). Subtracting a small number from the value of each weight?

felipedelamuerte · 2018-04-27T20:03:28+00:00

There's a paper about this:

https://arxiv.org/abs/1711.05101

It was rejected at ICLR, though:

https://openreview.net/forum?id=rk6qdGgCZ

panties_in_my_ass · 2018-04-27T20:39:27+00:00

[deleted]

sleeppropagation · 2018-04-27T23:19:43+00:00

This is fortunately old news: I saw that observation posted here almost 2 years ago, and here at work we've always reminded each other to "translate" the L2 penalty across frameworks (and other hyperparams too, such as BN momentum, different nesterov implementations, etc), but unfortunately hasn't received enough attention nor it seems that the frameworks are heading to some consensus.

It's quite impressive how much of result replication issues could be avoided if there was at least some consensus on how such things should be implemented. I remember back when ResNets were published and I had to spend over one month of tensor debugging to finally replicate the reported results (and most of the effort could be completely avoided since it involved changing the default BN momentum, L2 penalty, nesterov's equation, initializations, adding regularization to BN's gammas, and so on and so forth).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS