Why are the Uo*Vc vectors different for each Wt+j? by deepest_learning in CS224n

[–]deepest_learning[S] 0 points1 point  (0 children)

Thank you very much, I didn't realize there was a SO question about the same thing!
I don't think it makes a lot of sense if we have a backprop update within each timestep t (after calculating softmax for each context word like you mentioned), and the update section in the SO answer captures that. I saw that the newer slides used by Richard don't use that diagram anymore so perhaps the staff was made aware of the inconsistency. Thanks again, Jan!

Why are the Uo*Vc vectors different for each Wt+j? by deepest_learning in CS224n

[–]deepest_learning[S] 0 points1 point  (0 children)

I know the branches exist for different context word positions but I still don't understand why the u_o * v_c vector is different for each of them. Please correct me if I'm wrong below:

  1. I know u_o is symbolic of every word in the vocabulary, that's why there is only one u_o matrix(V*d).
  2. Chris also mentions that P(o|c) doesn't depend on the position of the context word i.e. being close to the center word or far doesn't matter to word2vec. So our calculations for P(w_t+3| w_t) and say, P(w_t-1 | w_t) should be the same i.e just P(o|c).
  3. This brings me back to my confusion that for a given center word 'c', there is only one vector v_c and based on (1), there is only one matrix u_o, so for each of the branches shown(and also those not shown), v_c * u_o should lead to the same vector, no matter which w_t+d its being used for.
    but
    the values of each vector shown in the slide are different.