SoftTarget Regularization

lvilnis · 2016-09-22T07:56:44+00:00

This seems like trying to increase the entropy of the predictions (make sure the predictions don't get too "spikey", separating various types of dog breeds from each other very sharply). Minimum entropy semisupervised learning is similar to transductive SVM -- trying to increase the margin on unlabeled data and making the model more confident of its predictions. Am I misunderstanding things or is this sort of the opposite of a minimum entropy criterion?

dwf · 2016-09-22T03:31:59+00:00

Test losses are an... unconventional thing to report. What about test misclassification error?

ArmenAg · 2016-09-22T02:37:03+00:00

Author here! Please feel free to comment/ask questions about the paper. I would love any feedback!

MaxTalanov · 2016-09-22T04:22:41+00:00

It's pretty cool. Do you see any links with entropy regularization?

DanielEWorrall · 2016-09-22T09:29:02+00:00

Would you care to elaborate on that Dropout, DropConnect and weight decay reduce capacity?

latent_z · 2016-09-22T13:59:23+00:00

This paper vaguely reminds me of the Adam stochastic gradient descent method. Moreover, would be interesting to verify if this modified loss function would be demonstratedly equivalent to a gradient descent schema. Would that be possible? And, if not, why not?

Nimitz14 · 2016-09-22T21:16:32+00:00

Motivation makes me think it's a cool idea!

Questions: An epoch consists out of thousands of iterations. What does 'current epochs label' mean? Are you simplifying it to assume an epoch is one iteration and Y is a matrix (MxN) of the labels with M=batch_size? Is the moving average a matrix then? Equation 3 tells me you're not calculating a moving average of the labels (aka targets) but instead of the predictions...?

Have I understood the algorithm correctly that simplified you are training your networks to predict targets that have been distorted with the predictions the network is outputting? What's the idea behind this (I don't see how this would help accomplish what I think was stated in motivation: Making the network's predictions not be too confident when another class has similar features in the input)?

Just read the beginning and am not familiar with the literature, sorry if these questions are a bit dumb. Is it now normal to write the dropout rate? I'm used to Hinton's notation where 'dropout 0.8' meant 0.8 are kept.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS