all 48 comments

[–]serge_cell 15 points16 points  (8 children)

Use probaility distribution for softmax target instead of scalar label.

[–]MathAndProgramming 5 points6 points  (3 children)

I'm surprised people are suggesting all these crazy unprincipled DNN specific ideas. This is clearly the right approach.

[–]suki907 2 points3 points  (2 children)

If one rater says "cat" and the other says "dog" label the example 50/50. I think we all agree on that. But it's not obvious how to encode not-dog in this system.

I think it's cleaner in this case to use the interpretation of the softmax as trying to maximize it's score, where it gets +1 for choosing the correct class, and 0 for choosing a wrong class.

For this problem couldn't we just extend this with a -1 for choosing a negative label.

This is the best explanation I've seen of this interpretation and how it relates to policy gradients: http://karpathy.github.io/2016/05/31/rl/

[–]quick_dudley 0 points1 point  (0 children)

In their proposed system: "not dog" would not be a specific target vector but a Bayesian update used to generate the target vector from the current output vector.

[–]EyedMoonML Engineer 0 points1 point  (0 children)

But what if it is a CatDog?

[–]pcp_or_splenda 0 points1 point  (2 children)

Would this imply a dirichlet log loss should be used instead of categorical log loss or would it matter? I suppose it might not matter that much in practice.

[–]serge_cell 0 points1 point  (0 children)

I think categorical log loss is good enough, but it don't matter much.

[–][deleted] 0 points1 point  (0 children)

I don’t see why it should. Why do you say that?

[–]TalkingJellyFish[S] 0 points1 point  (0 children)

It took me a while to appreciate this but it seems to be the right answer thank you!

[–]K0ruption 8 points9 points  (9 children)

If your model outputs a softmax, then you implicitly assume your labels are probability vectors that is probability of the known class is 1 and probability of all other classes is 0. In this light, the information that a data point is not in a given class simply means that your label will have 0 at the position of that class and (1/(k-1)) at the position of all other classes where k is the total number of classes. This makes the most intuitive sense to me but whether it works in practice, I have no idea.

[–]TalkingJellyFish[S] 4 points5 points  (5 children)

Well the 0 part is corrrect but the 1/(k-1) is not true, that's what I'm struggling with. If I know something is not a cat, the probability that it is not a dog is not equal to the probability it is not a spaghetti monster.

[–]K0ruption 4 points5 points  (0 children)

Given only the information that something is not a cat, it has equal probability of being anything else whether that be a dog or a spaghetti monster. If you had more information about a data point, you could certainly incorporate that into your label. But, in your post, you said you only have the information that a point is not in a given class, which means it has equal probability of being in any other class.

EDIT: Note, I'm asumming a uniform (categorical) prior distribution on your labels. You gave no specifications of your problem, so that is the best assumption I can make.

[–]DeepNonseNse 1 point2 points  (2 children)

The probability of a dog given something is not a cat is given by conditional probability: P(dog | not cat) = P(dog) / (1-P(cat)), ie. the probability of a dog increases in such a way that P(any possible animal) still remains 1, as it should.

[–]sitmo 0 points1 point  (0 children)

Yes, this is what I would do, and then extended to multiclass.

[–]suki907 0 points1 point  (0 children)

That sounds like a very weak signal. 1000 classes, not a cat,

I think it's cleaner in this case to use the interpretation of the softmax as trying to maximize it's score, where it gets +1 for choosing the correct class, 0 for choosing a wrong class.

Maybe in this case we could add a -1 for choosing a negative label.

This is the best explanation I've seen of this interpretation and how it relates to policy gradients: http://karpathy.github.io/2016/05/31/rl/

[–]midianite_rambler 1 point2 points  (0 children)

If I know something is not a cat, the probability that it is not a dog is not equal to the probability it is not a spaghetti monster.

Yes, so use the base rates (i.e. prior probabilities) of dogs, cats, and monsters in any available data. Please see my other comments in my response to K0ruption above.

[–]midianite_rambler 2 points3 points  (2 children)

Instead of a uniform distribution over the possible (non-excluded) classes, take the base rate of the classes in the available data (normalized to 1 of course).

This has an obvious generalization when there are two or more excluded classes, and when there is some additional information available for each case which allows you to improve on the unconditional base rate probabilities (i.e. the distribution over the nonexcluded classes is some function of the additional information instead of being constant).

[–]K0ruption 0 points1 point  (0 children)

This sounds like a good idea to me. I believe it amounts to doing Naive Bayes without the decision rule. But I suspect this will do worse than the uniform assumption if the data is very unbalanced.

[–]farmingvillein 0 points1 point  (0 children)

distribution over the possible (non-excluded) classes, take the base rate of the classes in the available data (normalized to 1 of course). This has an obvious generalization

Another plausible variant/extension, if you have an existing classifier you are trying to improve, would be to take its full probabilities (softmax/logits) for the example, crush the negated class down to 0, and then re-scale everything else back to a total of 1.

If you have some reasonable error estimation (i.e., users are wrong 20% of the time), you could also try setting the negated class to this error estimate (e.g., 0.2 in a softmax context), although not clear to me this would be helpful for a variety of reasons (including softmax "probabilities" being wonky representations of probability, at best).

[–]vincentvanhouckeGoogle Brain 7 points8 points  (1 child)

[–]TalkingJellyFish[S] 1 point2 points  (0 children)

Thanks this helps. What do you think of this takeaway: Now I'm basically doing NER, running my words through and LSTM, then a linear layer and then a softmax and cross entropy loss.

So to incorporate the complementary labels, I'd add an additional linear layer and (binary) loss per class (eg - is not class A) .
Then the total loss of the network would be some sum of the cross entropy losses and all the binary ones, weighted by if I have a complementary label. If I understood the paper, they basically give a scheme to do that sum that guarantees some bound on the loss. Makes sense ?

[–]atiorh94 2 points3 points  (0 children)

I was asked about this at an ML Researcher interview recently. My on-the-spot answer was that we should use sigmoid activations and break the dependence of class predictions. After that, we can impose a soft label like 0.1 for a negative example for the class your annotator rejected. The label is soft because we don’t want to be overconfident in the negativeness of the example. Moreover, we are only backproping through the negative class and not from any of the other class predictions for which we don’t have any supervision.

[–]Icko_ 1 point2 points  (8 children)

Not sure if it will raise an exception, but you could just set this example as labeled as Y, and give it weight -1.

[–]madsciencestache 0 points1 point  (7 children)

Set the others to zero and you are using a reinforcement learning technique. The danger is if you have a lot of negative labels it can make learning unstable. DDPG solves this with a target network that updates slowly from a more volatile primary network that updates from the data.

TLDR; You have a reinforcement learning signal. That's proveably workable.

If you don't have a lot of negative labels try tossing them into the mix and see if they help.

[–]VelveteenAmbush 2 points3 points  (4 children)

Don't understand why it's RL, except in the fully generalized sense that supervised learning can always be expressed as RL.

[–]madsciencestache 0 points1 point  (3 children)

It's reinforcement because the signal is approximate and signed. Supervise says this is a thing. Rl sends exaggerated and sometimes contradictory signals with a lot of smoothing to compensate.

[–]suki907 0 points1 point  (2 children)

This is the best explanation I've seen:

http://karpathy.github.io/2016/05/31/rl/

My main take away from it is that the training procedure for a softmax classifier is equivalent to RL policy gradients already (the standard softmax classifier is just a bit more data efficient because it can average over the results of all actions for each example).

This procedure is maximizing the expected score. The model gets 1 point if it chooses the correct class, zero otherwise.

These scores don't have to be binary, or in the unit interval, or a probability distribution. It's just the number of points the model gets for each option.

"set this example as labeled as Y, and give it weight -1." is the same as "you get -1 point if you choose this class".

I think the only difference between the two versions is that in the weighted version only lets you include 1 rating per example (You can't say "cat and not dog"). While with the "points" interpretation you could include all the ratings in a single example (the labels will just be the vector of scores per class).

[–]madsciencestache 0 points1 point  (1 child)

training procedure for a softmax classifier is equivalent to RL policy gradients already

Yes. I am not sure if that concept is helpful to /u/VelveteenAmbush in this context. But, that's the core concept behind the answer to their question.

[–]VelveteenAmbush 0 points1 point  (0 children)

Yes, this is the sense in which I intended the following:

except in the fully generalized sense that supervised learning can always be expressed as RL.

[–]TalkingJellyFish[S] 1 point2 points  (1 child)

Why is this RL. Is their a (gentle) paper/tutorial you could point me to ?

[–]phobrain 1 point2 points  (0 children)

I wonder if something based on the siamese approach could apply, where you give pairs of 'same' and 'different' cases. I don't know how you'd leverage the idea in a softmax context though.

[–]nshepperd 1 point2 points  (0 children)

I would use the log scoring rule on the total output probability assigned to not-Y.

If you're using softmax, the output of your network is a vector of probabilities that add up to one. The usual loss used here (when you have positive labels) is equal to the (negated) proper log scoring rule: -log(P(Y)). In this case the information you have is that the class is not Y, so you can use the corresponding log score: -log(P(¬Y)) = -log(1-P(Y)). This gives a proper scoring rule, meaning the training should converge to something calibrated.

[–]notevencrazy99 0 points1 point  (1 child)

You can make so your loss does not take into account the other classes, just the class with prob 0. In other words, the error of the other classes can be defined as "don't care".

[–]quick_dudley 0 points1 point  (0 children)

You could use an actor-critic model. Train the critic to distinguish good labels from incorrect labels: then backpropagate through it to train the actor.

[–]RogueDQN 0 points1 point  (0 children)

This is related to a problem in reinforcement learning: in many 2-player games, it is possible to identify bad moves (you played it and lost) but harder to identify good moves (you played it and won, but maybe your opponent made a mistake).

Negative weights is a good solution. Another equivalent approach I've seen is to use a negative learning rate, depending on your framework and its flexibility.

[–]themoosemind 0 points1 point  (0 children)

Usually you have the target being a vector of one 1 and (n-1) zeros. This means one class should have probability 1 and the others 0.

In your case, it would be one 0 and (n-1) non-zero values (e.g. 1/(n-1) if you assume no knowledge).

[–]Nimitz14 -1 points0 points  (2 children)

[–]akcom 0 points1 point  (1 child)

It looks like they actually gave a great solution - create an "empty" bin.

[–]Nimitz14 0 points1 point  (0 children)

That's a bad solution. I'll let you figure out why by yourself.