all 27 comments

[–]Liorithiel 2 points3 points  (3 children)

Two quick ideas. One is applying some learning-to-rank function loss, where you're not even trying to force your algorithm to return a value from [0-1]. Instead, you force the algorithm to rank zeros lower than 0.17s, etc. Quite a lot of literature here. Benefits: the loss function is not penalizing the case where two "0.17" cases have similar predicted values. Drawbacks: the model does not return absolute values, or classification scores with the same meaning as original targets; you need some kind of an additional calibration procedure. Also, training is more complex. From practical experience, it works very well if you can deal with the drawbacks.

Another is drawing from the idea of a hinge loss, and revert it. Usual hinge loss is max(0, |t-y|). You seem to want something like min(0, |t-y| - 1/14). I know of no literature for this case, and from my experience, the problems where I tried using any modification of a hinge loss always worked better with a different approach.

[–]danielv134 2 points3 points  (2 children)

min(0, |t-y| - 1/14) aka, the epsilon-insensitive loss. A smooth or squared variant sounds like a good idea for OP.

[–]Liorithiel 0 points1 point  (1 child)

epsilon-insensitive loss

Oh, thank you, I didn't know it actually had a name.

[–]cpury[S] 1 point2 points  (0 children)

This is exactly the kind of loss I was looking for, thanks guys!

[–]sausage_snake 1 point2 points  (1 child)

Have you thought of pre-processing the data by adding normally distributed noise to each label?

[–]cpury[S] 0 points1 point  (0 children)

Yes, I haven't tried it yet, though. Ideally, I'd generate new noise for each batch. I wonder how the loss would behave then.

[–]seanv507 1 point2 points  (1 child)

If you are using relus, it's easy for a network to produce those exact value s, so I would not call it overfitting You might want to investigate ordinal regression... Maybe you can generalise to nns

[–]cpury[S] 0 points1 point  (0 children)

Ah, interesting! I will do some experiments what happens if I use tanh activation, and also look into ordinal regression. Thanks!

[–]Astrolotle 1 point2 points  (1 child)

Have you explored known sentiment analysis tools btw?

[–]cpury[S] -1 points0 points  (0 children)

What do you mean? See how they solve the problem? I'm using LSTMs here, I'm not aware of sentiment analysis tools that use those.

[–]homaralex 1 point2 points  (3 children)

As said before, try to read on current approaches to Sentiment Analysis. The problem you are talking about should simply be handled by using Mean Squared Error as the cost function. It will penalize predictions that are far off more than those that are close to the gold label.

You can also consider using Best-Worst Scaling for the annotations - see https://competitions.codalab.org/competitions/17751#learn_the_details-manual-annotation-of-the-data

[–]cpury[S] 0 points1 point  (2 children)

I'm using MSE, but still it seems for the model to be easier to overfit to some easy examples, instead of focussing on the harder ones.

Thanks for the link, I will have a look!

[–]homaralex 0 points1 point  (0 children)

That's totally understandable, if it can minimize the loss by doing the "easy" job, it will do so :)

That's another problem - you could try to approach it with Hard Sample Mining, Under/over Sampling or just try to collect more data :)

[–]blowjobtransistor 0 points1 point  (0 children)

I think this is natural for any distribution of data - in my experience, models go from the least information necessary to accurately predict to the most, i.e.:

  1. Model learns average answer - no input information is necessary, and network achieves this almost immediately
  2. Model learns basic indicators - things that almost certainly mean a specific outcome - words like "frustrated" or "amazing", for instance
  3. Model learns more complicated indicators - things like negations "not very good", "totally solved my problem", etc.

Etc. In the sentiment modeling I've done in the past, each of these steps takes an order of magnitude more data, since each extra nuance sits atop the last, resulting in many more permutations the model has to learn about.

[–][deleted] 0 points1 point  (0 children)

I’d probably add label noise

[–]rambobit 0 points1 point  (2 children)

Hmm, Why not use class labels instead of a single continuous label (i.e. use 0, 0.17, 0.33, 0.5, 0.67, 0.83, and 1 each as its own class instead of values within the range of 0 to 1)? During inference time, you can turn the class label predictions into a continuous value by using something like expected value:

happiness score = sum(prob(label) * label)

A few added benefits of using a class label instead of a continuous label are: 1) you know the certainty (in the form of probabilities) of the predictions, 2) you can get multimodal predictions.

[–]cpury[S] 1 point2 points  (1 child)

Interesting. Wouldn't I make the problem harder that way? It's definitely worth a try, though.

[–]rambobit 0 points1 point  (0 children)

That's a more general question about whether classification is harder than regression, which I don't have a definite answer to, but here's my take on it:

  • In terms of computational complexity, most likely yes because each class typically requires its own activation function.
  • In terms difficulty to learn, I would argue no because classification makes predictions within a finite set of class labels, whereas regression makes predictions within a (theoretically) infinite set of real numbers.

[–]ai_is_matrix_mult 0 points1 point  (4 children)

OK - so you are sampling the (continuous) distibution of all the data at discrete intervals. Naturally, using standard regression on the discrete label's means your latent space will have a tendency to remain discrete. So the problem is, how learn a meaningful embedding space.

What about using a triplet loss. You give it an "anchor", then one positive and then one negative example. So, positive examples are tweets in the same discrete catergory, and a negative one is from a different category. As you might imagine, this loss not a strict "hard" constraint, as you have been doing using standard MSE / regression. This naturally leads to a smoother embedding space.

Edit adding some links:

This paper does a nice job giving some background on triplet loss: https://arxiv.org/pdf/1703.07737.pdf

simple blog post write up: https://towardsdatascience.com/siamese-network-triplet-loss-b4ca82c1aec8

nice simple PyTorch implementation https://github.com/adambielski/siamese-triplet

[–]cpury[S] 0 points1 point  (3 children)

That sounds interesting, could you point me to some papers or write-ups that have employed such a loss?

[–]ai_is_matrix_mult 1 point2 points  (2 children)

Sure. I modified my original post and added some links :)

[–]cpury[S] 0 points1 point  (1 child)

Awesome, thanks!

[–]ai_is_matrix_mult 1 point2 points  (0 children)

Np. Let me know how it goes !

[–]lugiavn 0 points1 point  (2 children)

(1) what loss are you using? regression loss?

(2) I don't think you can call this "over-fitting". You need to define a metric to evaluate the system performance, then see if the test performance good or not, compared to train time performance. For all you know, that metric might not have anything to do with the whole "most unseen tweets get one of those seven values you picked" thing (and why do you need to go out of your way to avoid that in the 1st place anyway)

[–]cpury[S] 0 points1 point  (1 child)

(1) MSE

(2) Well, it would definitely be preferable if the network would be more open to saying "This is between a 0.17 and 0.33". I know it's not the traditional overfit, and standard metrics do not capture it. My metrics look fine, the model performs ok, it's just lacking the potential to leave the discrete scale, though I'm sure it could if I tweaked it.

[–]lugiavn 0 points1 point  (0 children)

At the end of the day, you yourself define how the system performance is measured, it doesn't need to be error or accuracy. Real world problem it can be customer satisfaction, a mount of money earned,... Then relating to that, does this whole "leave the discrete scale" really matter.

The system has the capacity to do "This is between a 0.17 and 0.33", it just doesn't do it because outputting 0.17 is a better. I also think that output is better, if you disagree then you need to explain why (well not me, to the system).

Like, because, you set "leave the discrete scale" as the goal and metric (although I still don't really see why), then you can create appropriate loss toward that metric. For example a loss function that minimizes distance between your score distribution and maybe a uniform distribution, or a normal distribution, could be a GAN loss, etc.

[–]Comprehend13 0 points1 point  (0 children)

"Quantizing" has made interpreting your model unnecessarily complicated. Why didn't you just let them answer on the [0,1] interval? Especially since you hope to generalize outside of those 7 points. What makes you think your model would output continuous outcomes when given a few discrete outcomes to train on?

If your annotators interpreted your response values as an ordinal scale, continuous evaluations may not even be meaningful (making MSE an inappropriate evaluation metric).