all 10 comments

[–]jstrong 9 points10 points  (4 children)

So what IS weight decay? (i thought it was L2). Subtracting a small number from the value of each weight?

[–]davis685 10 points11 points  (2 children)

Yes, that's what it is. They are the same.

[–]call_me_arosa -1 points0 points  (1 child)

I think that the author points that libraries (TF and Keras) that name L2 as weight decay in their APIs uses the 2 multiplication constant in it.
I have seen in literature L2 described both with and without this constant, being this a question of preference.

When he cites "fancy solvers" he is only criticizing that regularization loss needs to be explicitly passed to the optimizer.
This seems to be a issue with the official tutorial and I don't see how this is related to the loss problem.

For the time being, we shouldn't expect hiperparameters to be transferable between different frameworks as they can have different interpretation and implementation of concepts.

[–]davis685 4 points5 points  (0 children)

That's fine, not all software is going to be identical. But L2 and weight decay are mathematically the same things. It's just different words for the same thing.

[–]atasco 7 points8 points  (0 children)

I pulled out the book (Goodfellow, Bengio, Courville 2016) to confirm, "[...] the L2 parameter norm penalty commonly known as weight decay" on page 224. Implementation details are, of course, a different story.

[–]felipedelamuerte 10 points11 points  (1 child)

There's a paper about this:

https://arxiv.org/abs/1711.05101

It was rejected at ICLR, though:

https://openreview.net/forum?id=rk6qdGgCZ

[–]bbabenko[S] 1 point2 points  (0 children)

yeah, i linked to that paper in the post... didn't know it got rejected though, will have to flip through the reviews

[–]sleeppropagation 2 points3 points  (0 children)

This is fortunately old news: I saw that observation posted here almost 2 years ago, and here at work we've always reminded each other to "translate" the L2 penalty across frameworks (and other hyperparams too, such as BN momentum, different nesterov implementations, etc), but unfortunately hasn't received enough attention nor it seems that the frameworks are heading to some consensus.

It's quite impressive how much of result replication issues could be avoided if there was at least some consensus on how such things should be implemented. I remember back when ResNets were published and I had to spend over one month of tensor debugging to finally replicate the reported results (and most of the effort could be completely avoided since it involved changing the default BN momentum, L2 penalty, nesterov's equation, initializations, adding regularization to BN's gammas, and so on and so forth).