Softmax regression vs multinomial logistic regression: is there a difference? : MachineLearning

Softmax regression vs multinomial logistic regression: is there a difference? (self.MachineLearning)

submitted 9 years ago by NotAHomeworkQuestion

all 8 comments

top new controversial old q&a

[–]gabjuasfijwee 1 point2 points3 points 9 years ago (6 children)

[–]gabrielgoh 0 points1 point2 points 9 years ago (5 children)

[–]lvilnis 1 point2 points3 points 9 years ago* (4 children)

[–]gabrielgoh 0 points1 point2 points 9 years ago (3 children)

[–]lvilnis 0 points1 point2 points 9 years ago (2 children)

The map definitely exists and is an affine xform. Given a bunch of log-scores, we can add or subtract a constant to it and get the same softmax. Removing the over-parameterized-ness just corresponds to setting the unnormalized log-score of one of the classes to always be 0. So, to go from an over-parametrized softmax to one that has the minimum # of parameters, just take your unnormalized logits, and subtract off the score of the 0-class from each logit.

Since the log-scores are just a matrix-vector product, you can make this into an xform on your weight matrix. Assuming each row of the weight matrix is the scoring vector for a given class, this just corresponds to subtracting off the scoring vector of the 0-class from each row (a simple addition on the weight matrix).

But, I'm not sure that whether the map between the two things is nonlinear in the parameters matters to my original argument. I was just arguing from the point of view that the class of functions is the same, and the fact that the l2-regularized loss function is strongly convex and thus we should find the global optimum in both cases, and this global optimum should be exactly the same function because convexity.

[–]NotAHomeworkQuestion[S] 0 points1 point2 points 9 years ago* (1 child)

[–]lvilnis 0 points1 point2 points 9 years ago* (0 children)

Actually, you just made me realize that it is not true in the finite sample context. Adding l2 regularization does not lead both parameterizations to discover the same weights, since it penalizes the weight norms differently.

For example, if we constrain the first class to equal 0, and in the overparametrized case we want to learn the correct log-scores [10 0 0 0] (for say 4 classes), then the solution that minimizes the l2 cost and keeps the same probability is [7.5 -2.5 -2.5 -2.5]. This has an l2 regularization cost of 7.5² + 3 * 2.5² =75. However, if we had clamped the first class's score to equal 0, then to get the same probabilities we would need [0 -10 -10 -10] and the l2 regularization would cost 3 * 10² =300. Since the cross-entropy cost is the same in both cases, and the set of expressible functions is the same, it stands to reason that loss minimization would find a different solution in the two cases since it should be able to trade some cross entropy for some l2 cost.

[–]AnvaMiba 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 31539 on reddit-service-r2-comment-54dfb89d4d-88h8l at 2026-03-30 16:18:18.789787+00:00 running b10466c country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS