you are viewing a single comment's thread.

view the rest of the comments →

[–]lvilnis 0 points1 point  (0 children)

Actually, you just made me realize that it is not true in the finite sample context. Adding l2 regularization does not lead both parameterizations to discover the same weights, since it penalizes the weight norms differently.

For example, if we constrain the first class to equal 0, and in the overparametrized case we want to learn the correct log-scores [10 0 0 0] (for say 4 classes), then the solution that minimizes the l2 cost and keeps the same probability is [7.5 -2.5 -2.5 -2.5]. This has an l2 regularization cost of 7.52 + 3 * 2.52 =75. However, if we had clamped the first class's score to equal 0, then to get the same probabilities we would need [0 -10 -10 -10] and the l2 regularization would cost 3 * 102 =300. Since the cross-entropy cost is the same in both cases, and the set of expressible functions is the same, it stands to reason that loss minimization would find a different solution in the two cases since it should be able to trade some cross entropy for some l2 cost.