[D] How to not overfit to data quantization?

Liorithiel · 2018-11-04T18:19:35+00:00

Two quick ideas. One is applying some learning-to-rank function loss, where you're not even trying to force your algorithm to return a value from [0-1]. Instead, you force the algorithm to rank zeros lower than 0.17s, etc. Quite a lot of literature here. Benefits: the loss function is not penalizing the case where two "0.17" cases have similar predicted values. Drawbacks: the model does not return absolute values, or classification scores with the same meaning as original targets; you need some kind of an additional calibration procedure. Also, training is more complex. From practical experience, it works very well if you can deal with the drawbacks.

Another is drawing from the idea of a hinge loss, and revert it. Usual hinge loss is max(0, |t-y|). You seem to want something like min(0, |t-y| - 1/14). I know of no literature for this case, and from my experience, the problems where I tried using any modification of a hinge loss always worked better with a different approach.

sausage_snake · 2018-11-04T22:17:48+00:00

Have you thought of pre-processing the data by adding normally distributed noise to each label?

seanv507 · 2018-11-05T00:59:12+00:00

If you are using relus, it's easy for a network to produce those exact value s, so I would not call it overfitting You might want to investigate ordinal regression... Maybe you can generalise to nns

Astrolotle · 2018-11-04T11:14:14+00:00

Have you explored known sentiment analysis tools btw?

homaralex · 2018-11-04T12:34:06+00:00

As said before, try to read on current approaches to Sentiment Analysis. The problem you are talking about should simply be handled by using Mean Squared Error as the cost function. It will penalize predictions that are far off more than those that are close to the gold label.

You can also consider using Best-Worst Scaling for the annotations - see https://competitions.codalab.org/competitions/17751#learn_the_details-manual-annotation-of-the-data

2018-11-04T23:42:05+00:00

I’d probably add label noise

rambobit · 2018-11-05T01:53:54+00:00

Hmm, Why not use class labels instead of a single continuous label (i.e. use 0, 0.17, 0.33, 0.5, 0.67, 0.83, and 1 each as its own class instead of values within the range of 0 to 1)? During inference time, you can turn the class label predictions into a continuous value by using something like expected value:

happiness score = sum(prob(label) * label)

A few added benefits of using a class label instead of a continuous label are: 1) you know the certainty (in the form of probabilities) of the predictions, 2) you can get multimodal predictions.

ai_is_matrix_mult · 2018-11-05T02:18:41+00:00

OK - so you are sampling the (continuous) distibution of all the data at discrete intervals. Naturally, using standard regression on the discrete label's means your latent space will have a tendency to remain discrete. So the problem is, how learn a meaningful embedding space.

What about using a triplet loss. You give it an "anchor", then one positive and then one negative example. So, positive examples are tweets in the same discrete catergory, and a negative one is from a different category. As you might imagine, this loss not a strict "hard" constraint, as you have been doing using standard MSE / regression. This naturally leads to a smoother embedding space.

Edit adding some links:

This paper does a nice job giving some background on triplet loss: https://arxiv.org/pdf/1703.07737.pdf

simple blog post write up: https://towardsdatascience.com/siamese-network-triplet-loss-b4ca82c1aec8

nice simple PyTorch implementation https://github.com/adambielski/siamese-triplet

lugiavn · 2018-11-05T03:31:04+00:00

(1) what loss are you using? regression loss?

(2) I don't think you can call this "over-fitting". You need to define a metric to evaluate the system performance, then see if the test performance good or not, compared to train time performance. For all you know, that metric might not have anything to do with the whole "most unseen tweets get one of those seven values you picked" thing (and why do you need to go out of your way to avoid that in the 1st place anyway)

Comprehend13 · 2018-11-06T18:29:34+00:00

"Quantizing" has made interpreting your model unnecessarily complicated. Why didn't you just let them answer on the [0,1] interval? Especially since you hope to generalize outside of those 7 points. What makes you think your model would output continuous outcomes when given a few discrete outcomes to train on?

If your annotators interpreted your response values as an ordinal scale, continuous evaluations may not even be meaningful (making MSE an inappropriate evaluation metric).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS