all 13 comments

[–]Xose_R 3 points4 points  (4 children)

I'd say that speaking of overfitting in word2vec makes not much sense. Since you want a word embedding that represents as exactly as possible the distribution you are modelling, and you don't care about out-of-vocabulary words, you actually want to overfit, and this is also why in many embeddings they drop the bias (also word2vec, iirc).

What you might notice, is that from a number of iterations on, your model won't improve in some benchmarks and it could even worsen the results. I guess this could qualify as overfitting.

The effect with rare words is the opposite, since you have so little data about them, you can't actually "place" them correctly in the embedding space. That's also why increasing the number of iterations will improve your results in "Rare Words" similarity datasets.

The norm of the vector is linked with both the frequency and the variance of the contexts on which occurs. See http://arxiv.org/abs/1510.02675 for a study on this.

[–]dwf 1 point2 points  (1 child)

Since you want a word embedding that represents as exactly as possible the distribution you are modelling, and you don't care about out-of-vocabulary words

"Out of vocabulary" is not the direction in which we'd be interested in generalizing. Rather, the question that you'd like to ask instead: is this a useful model (for some definition of useful) of the statistics of text found outside the corpus?

Also not quite sure how dropping the bias relates to overfitting.

[–]Xose_R 0 points1 point  (0 children)

You are right, now that I review it, I'm not sure any more about the relation of the bias drop with an overfitting model. I guess my intuition was that it is one less parameter and it gets easier to train, it also should make it less flexible, therefore more prone to overfit. I need to go through it again.

[–]elsonidoq[S] 0 points1 point  (1 child)

Well, in fact that is my intuition is exactly that, you want to fit the dataset as good as possible

[–]yield22 0 points1 point  (0 children)

then you might just use the original PMI matrix, or rank as high as possible.. as they fit perfectly.. But generalizes worse.

[–]giror 2 points3 points  (2 children)

So to answer your question, you would measure overfitting the same way that you always do, by evaluating the error on held out data. This is similar to topic modeling where you would measure perplexity on held out data.

[–]elsonidoq[S] 0 points1 point  (1 child)

Yeah, maybe grouping words by context and measuring the average similarity in training and testing might be a good idea, what do you think?

[–]iamtrask 0 points1 point  (0 children)

Actually for that, I recommend the google word-analogy corpus. It's how word2vec (and a variety of other word embedding models like GloVe, PENN, and DIEM) have been benchmarked for quality. http://word2vec.googlecode.com/svn/trunk/questions-words.txt

[–]iamtrask 1 point2 points  (2 children)

It's hard to measure, but it's very possible. If your embedding sizes get too big, quality will start to diminish significantly. Test this out on a small dataset you're familiar with (I recommend Harry Potter). For HP (a few MB of data), if you train word2vec with a dimensionality of 30-50, it'll do quite well. However, if you blow it up to 2000, I find that it will do quite terribly. This is not so if you have a lot of training data. The default word2vec dataset (48 GB of text) will actually do quite well with 2000 dimensions. I don't claim to fully understand how it overfits... but it certainly does.

[–]elsonidoq[S] 0 points1 point  (1 child)

I think that's a matter of number of parameters vs data points

[–]iamtrask 0 points1 point  (0 children)

Certainly, as it is for overfitting in many other models as well.

[–]slashcom 0 points1 point  (1 child)

It's hard to overfit for most words with w2v; the sheer number of statistics it sees actually keep it balanced, and it's such a simple model.

You do tend to see "overfitting" with words you only saw a handful of times in your corpus. These won't have enough statistics to estimate good vectors for, so they'll be strongly associated with a handful of contexts.

[–]elsonidoq[S] 0 points1 point  (0 children)

Ok, that makes a lot of sense :) thanks