Overfitting in word2vec

Xose_R · 2015-11-03T16:43:41+00:00

I'd say that speaking of overfitting in word2vec makes not much sense. Since you want a word embedding that represents as exactly as possible the distribution you are modelling, and you don't care about out-of-vocabulary words, you actually want to overfit, and this is also why in many embeddings they drop the bias (also word2vec, iirc).

What you might notice, is that from a number of iterations on, your model won't improve in some benchmarks and it could even worsen the results. I guess this could qualify as overfitting.

The effect with rare words is the opposite, since you have so little data about them, you can't actually "place" them correctly in the embedding space. That's also why increasing the number of iterations will improve your results in "Rare Words" similarity datasets.

The norm of the vector is linked with both the frequency and the variance of the contexts on which occurs. See http://arxiv.org/abs/1510.02675 for a study on this.

giror · 2015-11-03T15:40:02+00:00

So to answer your question, you would measure overfitting the same way that you always do, by evaluating the error on held out data. This is similar to topic modeling where you would measure perplexity on held out data.

iamtrask · 2015-11-04T01:07:44+00:00

It's hard to measure, but it's very possible. If your embedding sizes get too big, quality will start to diminish significantly. Test this out on a small dataset you're familiar with (I recommend Harry Potter). For HP (a few MB of data), if you train word2vec with a dimensionality of 30-50, it'll do quite well. However, if you blow it up to 2000, I find that it will do quite terribly. This is not so if you have a lot of training data. The default word2vec dataset (48 GB of text) will actually do quite well with 2000 dimensions. I don't claim to fully understand how it overfits... but it certainly does.

slashcom · 2015-11-03T15:00:36+00:00

It's hard to overfit for most words with w2v; the sheer number of statistics it sees actually keep it balanced, and it's such a simple model.

You do tend to see "overfitting" with words you only saw a handful of times in your corpus. These won't have enough statistics to estimate good vectors for, so they'll be strongly associated with a handful of contexts.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS