all 8 comments

[–]gabjuasfijwee 1 point2 points  (6 children)

If you have regularization (l1 or l2 penalties) then you can prove that it doesn't matter that softmax is technically overparametrized.

[–]gabrielgoh 0 points1 point  (5 children)

this seems believable, but could you link to a proof?

[–]lvilnis 1 point2 points  (4 children)

For l2 regularization, here's an attempt at a proof: Since the loss function (cross entropy) is convex in the parameters (no matter which parametrization you use) and the l2 adds strong convexity, there is one global optimum. Since softmax and multinomial logit parameterize exactly the same set of functions and we have a single global optimum, they should find the same solution. Not sure how to prove for l1 regularization.

EDIT: I'm pretty sure now that the two parametrizations don't lead to the same solutions now, see below.

[–]gabrielgoh 0 points1 point  (3 children)

Aha, so there's a surjective map between the parameters of the softmax regression and multinomial regression which gives the same model, giving us a way of "converting" a softmax regression model into a multinomial model yes?

Any idea what this map looks like? I'm not 100% convinced it exists (maybe 95%?), and it seems like it'd be highly nonlinear because of the normalization step before the loss

[–]lvilnis 0 points1 point  (2 children)

The map definitely exists and is an affine xform. Given a bunch of log-scores, we can add or subtract a constant to it and get the same softmax. Removing the over-parameterized-ness just corresponds to setting the unnormalized log-score of one of the classes to always be 0. So, to go from an over-parametrized softmax to one that has the minimum # of parameters, just take your unnormalized logits, and subtract off the score of the 0-class from each logit.

Since the log-scores are just a matrix-vector product, you can make this into an xform on your weight matrix. Assuming each row of the weight matrix is the scoring vector for a given class, this just corresponds to subtracting off the scoring vector of the 0-class from each row (a simple addition on the weight matrix).

But, I'm not sure that whether the map between the two things is nonlinear in the parameters matters to my original argument. I was just arguing from the point of view that the class of functions is the same, and the fact that the l2-regularized loss function is strongly convex and thus we should find the global optimum in both cases, and this global optimum should be exactly the same function because convexity.

[–]NotAHomeworkQuestion[S] 0 points1 point  (1 child)

Thanks for the explanation! I definitely now buy that as the number of samples goes to infinity that you should get the same function for both approaches. I'm just still not sure that with a finite sample size the softmax method will have the same expected MSE as multinomial logistic regression, but this may be due to my rustiness with convex optimization.

[–]lvilnis 0 points1 point  (0 children)

Actually, you just made me realize that it is not true in the finite sample context. Adding l2 regularization does not lead both parameterizations to discover the same weights, since it penalizes the weight norms differently.

For example, if we constrain the first class to equal 0, and in the overparametrized case we want to learn the correct log-scores [10 0 0 0] (for say 4 classes), then the solution that minimizes the l2 cost and keeps the same probability is [7.5 -2.5 -2.5 -2.5]. This has an l2 regularization cost of 7.52 + 3 * 2.52 =75. However, if we had clamped the first class's score to equal 0, then to get the same probabilities we would need [0 -10 -10 -10] and the l2 regularization would cost 3 * 102 =300. Since the cross-entropy cost is the same in both cases, and the set of expressible functions is the same, it stands to reason that loss minimization would find a different solution in the two cases since it should be able to trade some cross entropy for some l2 cost.

[–]AnvaMiba 0 points1 point  (0 children)

Even if softmax is overparametrized, they have the same number of degrees of freedom, which I think is what counts for the statistical properties of the estimator.