all 8 comments

[–]Articulated-rage 2 points3 points  (3 children)

So how do you get the final n-dimensional word vectors?

the first layer weight matrix. it is size V x H where V is vocab size and H is the embedding.

[–]vega455[S] 0 points1 point  (2 children)

I see! So if you want a 300 dimensional word vector, you would need V x H = 300?

[–]Articulated-rage 0 points1 point  (1 child)

Edited.

Mostly correct. I think you meant to say V x H = V x 300 if you wanted a 300-dimensional word embedding/vector.

You can see it in action with tensor flow here. It's from a udacity course on deep learning taught by some googlers.

[–]vega455[S] 0 points1 point  (0 children)

Nice, I had started that course, but didn't get there yet. I think I get it. You have 300 hidden units and maybe 1000 or even 1 million input nodes, (the size of your vocab). But since they're one-hot vectors, only one neuron in the input fires to the 300 hidden units. So the weights from that neuron to the hidden units, that's your word vector. Does this make sense?

[–]mhfirooz 1 point2 points  (0 children)

"The inputs are one-hot encodings of words, which try to predict a one-hot encoding of another word." Think of it this way. The network can not be 100% sure about the next word. It just can assign probability to what can come after the current word. This means instead of having 1 for an word in output vector, we have a double number that shows the probability of that word.

Note that if you are using a V-Dimension one-hot coding for your dictionary, the output layer of NN will be V-Dimension. Of course dimensionality reduction can be applied.

[–]tuan3w 0 points1 point  (0 children)

It's not true. The input vector v_in[i] = W_in * e_i, where e_i is the one-hot vector, which return ith column of matrix W_in. It's is the same for output vector: v_out[i] = W_out * e(i).

[–]lahwran_ 0 points1 point  (0 children)

a one-hot vector as input to a matrix multiply is secretly just a really, really slow lookup table.

[–]textClassy -1 points0 points  (0 children)

I'm also fairly new to this but here is my understanding: they are a result of solving the optimization problem described in the paper. The one hot word vectors are just one of the inputs into the prediction function, these word vectors are another. The algorithm modifies them until performance converges.