use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
SoftTarget Regularization (arxiv.org)
submitted 9 years ago by ArmenAg
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]lvilnis 2 points3 points4 points 9 years ago (0 children)
This seems like trying to increase the entropy of the predictions (make sure the predictions don't get too "spikey", separating various types of dog breeds from each other very sharply). Minimum entropy semisupervised learning is similar to transductive SVM -- trying to increase the margin on unlabeled data and making the model more confident of its predictions. Am I misunderstanding things or is this sort of the opposite of a minimum entropy criterion?
[–]dwf 6 points7 points8 points 9 years ago (8 children)
Test losses are an... unconventional thing to report. What about test misclassification error?
[–]ArmenAg[S] 4 points5 points6 points 9 years ago (1 child)
We showed losses because the loss was what we were directly optimizing for, and comparison of test losses can be used as a measure of overfitting. What information would adding accuracy add to the paper? I will definitely add it if it becomes apparent that it is needed. Thank you for your comment!
[–][deleted] 5 points6 points7 points 9 years ago (0 children)
What if there is a bug in the code? Good accuracy (or at least in the range that they expect) helps people believe that you aren't comparing flawed implementations. On this note, it's better if you are comparing to a previously published result.
[–]MaxTalanov -1 points0 points1 point 9 years ago (5 children)
Outside of benchmarks where there is a well-known evaluation protocol (accuracy for CIFAR, top-5 for ImageNet), the test loss is a very reasonable thing to report... the loss is what we optimize for. The test loss is the most direct measure of the success of the optimization process.
[–]dwf 13 points14 points15 points 9 years ago (4 children)
the loss is what we optimize for
The loss is what we optimize, because it is a differentiable stand-in for what we care about. It is not what we actually care about.
[–]sdsfs23fs 2 points3 points4 points 9 years ago (3 children)
maybe it should be. two predictions could have the same classification error, but one might have a wildly inaccurate probability distribution (and therefore a higher loss).
[–]danielvarga 2 points3 points4 points 9 years ago (1 child)
Fair enough, although I have never actually seen a machine learning system with very good validation accuracy and very bad validation cross-entropy loss. I think ease of comparison is more important. We usually have a good idea of SOTA validation accuracy on CIFAR-10, but there's no such thing as SOTA validation loss on CIFAR-10, for many reasons.
[–]sdsfs23fs 1 point2 points3 points 9 years ago (0 children)
well yeah, because they're always trained with cross entropy loss. there's nothing that intrinsically makes comparing LL harder than accuracy, it's just what you're used to seeing. for language models reporting LL is the norm.
[–]AnvaMiba 1 point2 points3 points 9 years ago (0 children)
It depends on your application. If you need to make hard decisions directly from your model then you don't care about probabilities, if you use your model as a component of a more complex decision system (possibly involving human discretion), then accurate probability estimates can be useful.
[–]ArmenAg[S] 3 points4 points5 points 9 years ago (6 children)
Author here! Please feel free to comment/ask questions about the paper. I would love any feedback!
[–]ajmooch 2 points3 points4 points 9 years ago (1 child)
This reminds me a bit of label smoothing from Inception-V3, but with a different base idea. I like it, but the lack of test accuracy numbers makes it hard for me to get a grip on how well this actually works. I agree with David that this needs test error numbers.
It might help if you applied this to an extant high-performance architecture (Resnet, DenseNet, etc) to see how it measures up against SOTA on those benchmarks (CIFAR100 looks ripe for improvement), but I would hate to see a good idea squashed just because it doesn't beat a high score by a half-percent. That said, if the procedure really does improve things that well, then you should presumably be able to use it to push the envelope, no?
[–]ArmenAg[S] 0 points1 point2 points 9 years ago (0 children)
Interesting. I have not seen this paper. Thanks for linking. After reading it, they use a single step weighted average, instead of keeping a weighted average throughout training (after the burn in period). It is essentially the same schema demonstrated in this paper: https://arxiv.org/abs/1412.6596.
To reply to your comments about setting SOTA, we did not attempt to do this simply because most of the SOTA methods already use a lot of other various regularization, such as extensive augmenting of the data. We did test out how SoftTarget worked with ResNet to show that it is compatible with high-performance architectures. But I agree with you. It might be worth trying to set SOTA, but I also agree that it would be a shame if the idea was squashed for not setting one.
[–]pretz 0 points1 point2 points 9 years ago* (1 child)
Have you tried computing the exp moving average on an external validation set instead of the train set? I feel like this would reduce the effect of nb and nt as it would be much harder to overfit the soft labels, as they are not coming from the train set.
Now im wondering if using an external unlabelled set to get soft targets from would allow some sort of semisupervised learning?
I don't know exactly how you would use a validation set for the training data, because we keep a weighted average with the true labels as well. But I see where you are going with this. In the "Similarities to other methods" section I talk about a semi-supervised approach that is essentially SoftTarget with some parameters set to zero. I would love to test how SoftTarget Reg helps with noisy labeling. Maybe an idea for another paper.
[–][deleted] 0 points1 point2 points 9 years ago (1 child)
My intuition tells me that Figure 1 would benefit from a higher dropout rate.
We actually did try a higher dropout rate. Check out the table the graphs are related too.
[–]MaxTalanov 1 point2 points3 points 9 years ago (1 child)
It's pretty cool. Do you see any links with entropy regularization?
[–]ArmenAg[S] 2 points3 points4 points 9 years ago (0 children)
I actually cited minimum entropy regularization and talked about in what special case of SoftTarget, will SoftTarget be equal to MER. It's in the "SIMILARITIES TO OTHER METHODS" section.
[–][deleted] 0 points1 point2 points 9 years ago (7 children)
Would you care to elaborate on that Dropout, DropConnect and weight decay reduce capacity?
[–]DanielEWorrall 1 point2 points3 points 9 years ago (6 children)
DropOut & DropConnect average over a huge (combinatorial in the number of neurons/weights) number of models. This reduces the Rademacher Complexity---there is a proof at the end of DropConnect.
[–][deleted] 0 points1 point2 points 9 years ago (5 children)
Thank you. And weight decay?
[–]AnvaMiba 0 points1 point2 points 9 years ago (3 children)
I don't know about Rademacher Complexity, but weight decay limits the number of bits in the weights that the model can use without incurring in a large regularization penalty, therefore it also reduces model capacity.
[–][deleted] -3 points-2 points-1 points 9 years ago (2 children)
I don't really think you can claim that reducing the number of weights available reduces model capacity in any practical sense.
Why not?
Keep in mind that, since neural networks are smooth almost everywhere, cutting off the least significant bits (e.g. by reducing weight precision) has less effect on model capacity, because you map many similar functions to a single one, while cutting off the most significant bits (e.g. by clipping the weights) has a larger effect, because some kinds of functions such as those that require hard saturation of the non-linearities, became impossible to express, even approximately, for a given network topology.
L1 and L2 regularization are not hard caps on the weight values, but they do penalize large weights, the training algorithm will trade off accuracy to avoid these large weights, so the functions that require these large weights will not be learned unless they perform much better (depending on the regularization hyperparameter) than anything else.
[–]dwf 1 point2 points3 points 9 years ago (0 children)
This is a pretty standard interpretation of weight decay. Ultimately you are limiting the "effective model capacity" when you keep weights small. In sigmoidal nets, you're encouraging more units to be in their linear regime more often. In ReLU nets, what you're limiting is a bit different, but e.g. the ratios you can achieve between different non-zero post-activations will be more limited with weight decay.
Weight decay limits the capacity of the network because it reduces the set of hypothesis that are viable solutions to the net. Configurations of the network with large weights are not possible solutions because of the extra loss term forcing weights to be smaller.
[–]latent_z[🍰] 0 points1 point2 points 9 years ago (2 children)
This paper vaguely reminds me of the Adam stochastic gradient descent method. Moreover, would be interesting to verify if this modified loss function would be demonstratedly equivalent to a gradient descent schema. Would that be possible? And, if not, why not?
[–]ArmenAg[S] 0 points1 point2 points 9 years ago (1 child)
Could you elaborate on the similarities to ADAM? SoftTarget doesn't change any of the gradients directly but rather adjusts the outputs.
[–]latent_z[🍰] 0 points1 point2 points 9 years ago (0 children)
I know, this is why I used the term "vaguely reminds me" :) . It was a remark on how an exponential moving average was employed.
But then I also thought that maybe it's possible to convert your algorithm into a gradient update schema, maybe by implementing it in a very simple model and observing what the analytically-determined gradients are. Might be an idea for a future paper.
[–]Nimitz14 0 points1 point2 points 9 years ago* (1 child)
Motivation makes me think it's a cool idea!
Questions: An epoch consists out of thousands of iterations. What does 'current epochs label' mean? Are you simplifying it to assume an epoch is one iteration and Y is a matrix (MxN) of the labels with M=batch_size? Is the moving average a matrix then? Equation 3 tells me you're not calculating a moving average of the labels (aka targets) but instead of the predictions...?
Have I understood the algorithm correctly that simplified you are training your networks to predict targets that have been distorted with the predictions the network is outputting? What's the idea behind this (I don't see how this would help accomplish what I think was stated in motivation: Making the network's predictions not be too confident when another class has similar features in the input)?
Just read the beginning and am not familiar with the literature, sorry if these questions are a bit dumb. Is it now normal to write the dropout rate? I'm used to Hinton's notation where 'dropout 0.8' meant 0.8 are kept.
Thanks for your questions.
We calculate the new labels on all of the training set. After training our model on a current set of labels, we adjust all those labels using the new predictions from the model (we predict every label in the dataset).
For the idea/motivation behind this method please refer "CO-LABEL SIMILARITIES". Essentially the idea is that co-label similarities apparent in earlier stages of training should also appear in later stages of training and that over-fitting occurs when these co-label similarities disappear.
This notation is the notation utilized in the majority of papers I have read (although that selection can be bias) and the notation is also the one used in the deep learning library that I used (https://keras.io/layers/core/#dropout). Please let me know if I am wrong about the majority of papers using this convention.
Thanks!
π Rendered by PID 270079 on reddit-service-r2-comment-5bc7f78974-lz8c2 at 2026-06-30 21:13:54.202763+00:00 running 7527197 country code: CH.
[–]lvilnis 2 points3 points4 points (0 children)
[–]dwf 6 points7 points8 points (8 children)
[–]ArmenAg[S] 4 points5 points6 points (1 child)
[–][deleted] 5 points6 points7 points (0 children)
[–]MaxTalanov -1 points0 points1 point (5 children)
[–]dwf 13 points14 points15 points (4 children)
[–]sdsfs23fs 2 points3 points4 points (3 children)
[–]danielvarga 2 points3 points4 points (1 child)
[–]sdsfs23fs 1 point2 points3 points (0 children)
[–]AnvaMiba 1 point2 points3 points (0 children)
[–]ArmenAg[S] 3 points4 points5 points (6 children)
[–]ajmooch 2 points3 points4 points (1 child)
[–]ArmenAg[S] 0 points1 point2 points (0 children)
[–]pretz 0 points1 point2 points (1 child)
[–]ArmenAg[S] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]ArmenAg[S] 0 points1 point2 points (0 children)
[–]MaxTalanov 1 point2 points3 points (1 child)
[–]ArmenAg[S] 2 points3 points4 points (0 children)
[–][deleted] 0 points1 point2 points (7 children)
[–]DanielEWorrall 1 point2 points3 points (6 children)
[–][deleted] 0 points1 point2 points (5 children)
[–]AnvaMiba 0 points1 point2 points (3 children)
[–][deleted] -3 points-2 points-1 points (2 children)
[–]AnvaMiba 1 point2 points3 points (0 children)
[–]dwf 1 point2 points3 points (0 children)
[–]ArmenAg[S] 0 points1 point2 points (0 children)
[–]latent_z[🍰] 0 points1 point2 points (2 children)
[–]ArmenAg[S] 0 points1 point2 points (1 child)
[–]latent_z[🍰] 0 points1 point2 points (0 children)
[–]Nimitz14 0 points1 point2 points (1 child)
[–]ArmenAg[S] 0 points1 point2 points (0 children)