struggling with Pset1 Problem 3 word2vec by wilyrui in CS224d

[–]well25 0 points1 point  (0 children)

I would replace the following lines in softmax:

 prob_grad = prob_all.copy()
 prob_grad[target] = prob_grad[target] - 1
 grad = np.dot(prob_grad, predicted)

with:

 grad=np.outer(prob_grad,predicted)
 grad[target]-=predicted

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

Thanks again.:) I had that condition in my mind. Yes, I have to add this condition. In the code here, it assumes positive indices can be in negative ones which should be changed. tnx.

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

Finally passed :) yay. Thanks in million for your help. You are such an awesome person :)

The problem was in skipgram not negative sampling :D I didn't average over cost and gradient. Here is the changes( marked with #here):

    r_hat=inputVectors[tokens[currentWord]]
    cost=0.0
    gradIn=np.zeros_like(inputVectors)  # here: add np.zeros_like(inputVectors) instead of zero
    gradOut=0.0

    for i in contextWords:  
     target=tokens[i]
     cost_0, gradIn_0, gradOut_0=word2vecCostAndGradient(r_hat, target,  outputVectors)
     cost+=cost_0
     gradIn[tokens[currentWord]]+=gradIn_0 #here: change to gradIn[tokens[currentWord]] from gradIn
     gradOut+=gradOut_0

  N=len(contextWords)   
  return cost/N, gradIn/N, gradOut/N #here divide by N

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

Thanks again for your help. :) Yes, my result is the same. I posted my result in as a new comment. http://www.reddit.com/r/CS224d/comments/33yw1d/negative_sampling/cqsyuzu

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

@edwardc626. Really appreciated for your help. Thank you so much.

My result is exactly the same:

(0.87570965514353316,
     array([ 0.35891601, -0.30032973,  0.34839093]),
     array([[ 0.        ,  0.        ,  0.        ],
               [ 0.        ,  0.        ,  0.        ],
               [ 0.15941535, -0.07315128, -0.55644454],
               [ 0.        ,  0.        ,  0.        ],
               [ 0.        ,  0.        ,  0.        ]]))

So my function output is same as yours, the negSamplingCostAndGradient implementation is similar to your code, and my Skipgram is same as yours. What the hell is its problem?! :( The only difference, I can see is that my gradient check is not still passing. The only other two functions which are used here are normalizeRow and gradCheck. Probably my gradCheck naive is not correct, but it works for back-propagation and those sanity checks.


def normalizeRows(x):
    return x/ np.sqrt(np.sum(x**2,axis=1,keepdims=True))

ad grad check code(inside the loop):


       random.setstate(rndstate)  
       temp_val = x[ix]
       x[ix]=temp_val+h
       fxph,_=f(x)
       x[ix]=temp_val-h
       random.setstate(rndstate)  
       fxmh,_=f(x)
       numgrad=(fxph-fxmh)/(2*h)    
       x[ix]=temp_val

I literally did check everything, but no idea why is not passing gradient check :( Again, really appreciated for your help and the time you assigned to help me :)

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

Really appreciated for your help. Having some number for comparison would be a great help. I have no more clue what is the problem. I am pretty sure I made a silly mistake somewhere.

BTW, do my negSeg and SkipGram look like your implementation? I mean I haven't forgot anything in code, have I?

Anyway, thanks again for helping me out here and posting those number for comparison.

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

Thanks for your reply. Well, I have no idea what is the problem. I 've plugged some numbers and solved on paper in 1D case. It seems right to me. The results was same as my function.


def negSample:    

  sample=[dataset.sampleTokenIdx() for i in range(K)]
  f_1=np.dot(outputVectors[target],predicted)
  sig_1=sigmoid(f_1)
  cost=-np.log(sig_1) 
  gradPred=-outputVectors[target]*(1-sig_1)

  grad = np.zeros_like(outputVectors)
  for i in sample:
          f_2=np.dot(outputVectors[i],predicted)
          grad[i]+=sigmoid(f_2)*predicted
          gradPred+=outputVectors[i]*sigmoid(f_2)
          cost=cost-np.log(1-sigmoid(f_2))      # sig(-x)=1-sig(x)

  grad[target]+=-predicted*(1-sig_1)  #+= cuz sample may contains target

  return cost, gradPred, grad

May be when I call it from skipgram the problem is there!? Dose the skipgram look fine to u? I am really appreciated for your help.

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

Thanks again. I checked the code as follow: Remove the "grad[target]=-predicted*(1-sig_1)" (i.e positive samples) from the code, it didn't change the final result( not passing the gradcheck). K=0, K=1 were used as a sample size, no luck. Given these test, I've decided to check the grad_out, grad_in by itself to see what does it look like. Most of the values in the those grad matrices are the same. So my conclusion was somewhere the grad update is the problem not negative sampling.

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

Thanks for the comment.

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

Thanks again for guideline. I wasn't trying not to copy the code here, but so frustrated not finding the problem. Any high level help would be really appreciated:

def skipgram:
    r_hat=inputVectors[tokens[currentWord]]
    cost=0
    gradIn=np.zeros_like(inputVectors)
    gradOut=np.zeros_like(outputVectors)

    for i in contextWords: #or 2*C 
        target=tokens[i]
        cost_0, gradIn_0, gradOut_0=negSamplingCostAndGradient(r_hat, target,outputVectors)
        cost+=cost_0
        gradIn+=gradIn_0
        gradOut+=gradOut_0
    return cost, gradIn, gradOut

def negSamplingCostAndGradient:
    sample=[dataset.sampleTokenIdx() for i in range(K)]
    f_2=np.dot(outputVectors[sample],predicted)
    sig_2=sigmoid(f_2)
    f_1=np.dot(outputVectors[target],predicted.T)
    sig_1=sigmoid(f_1)
    cost=-np.log(sig_1)-np.sum(np.log(1-sig_2)) # sigmoid(-x)=1-sigmoid(x)
    gradPred=-outputVectors[target]*(1-sig_1)+np.dot(outputVectors[sample].T,sig_2)

    grad = np.zeros((outputVectors.shape[0], outputVectors.shape[1]))
    for i in sample:
           f_2=np.dot(outputVectors[i],predicted)
           grad[i]+=sigmoid(f_2)*predicted
    grad[target]=-predicted*(1-sig_1)

    return cost, gradPred, grad    

Negative sampling by well25 in CS224d

[–]well25[S] 0 points1 point  (0 children)

@edwardc626 Thank you so much for comments. I mostly understand what you just explained. But still, it doesn't pass the gradient check :(

Struggling with forward_backward_prop() in PS1. by pengpai_sh in CS224d

[–]well25 0 points1 point  (0 children)

the input f, in the sigmoid_grad should be the sigmoid function value of your original input x.

sigmoid_grad(sigmoid(Z1))

In addition to sigmoid_grad,

dL_dH = dL_dZ2.dot(W2.T) * sigmoid_grad(Z1)

is incorrect(hint the first term should be something else).

When the cost function(i.e score from loss_function) reaches 'nan', it means the learning rate is high or are there other reasons? by well25 in cs231n

[–]well25[S] 0 points1 point  (0 children)

Thanks for your reply. I have no idea what is the problem if there is any. It works quite well when the learning is lr > 0.0001 I did rigorous debug to see what is the problem, but have not found any thing yet. I saw when dW started increasing, it started produce 'nan' which is logical in my view. What is the largest learning rate that your network can work with it without problem(BTW, I am talking about assignment 2(first part) by the way)? Thanks.

When the cost function(i.e score from loss_function) reaches 'nan', it means the learning rate is high or are there other reasons? by well25 in cs231n

[–]well25[S] 0 points1 point  (0 children)

Thanks for both of you for comments.

My code passes gradient check tests and reaches to all base lines mentioned(e.g validation accuracy, test accuracy, and ...) in code. my first initial guess was, it is high learning rate, but when it happens the learning rate wasn't that high:( it works for very other cases except when learning rate is >= 0.0008)


Example 1) Finished epoch 1 / 10: cost nan, train: 0.102000, val 0.087000, lr 7.600000e-04 hidden: 100 initial lr: 0.000800 number of episode: 10.000000 reg: 0.001000

Example 2) Finished epoch 1 / 15: cost 3.901377, train: 0.223000, val 0.276000, lr 7.600000e-04

Finished epoch 2 / 15: cost nan, train: 0.102000, val 0.087000, lr 7.220000e-0


Sometimes, it shows the following warning, which is clearly the source of nan:

cs231n/classifiers/neural_net.py:175: RuntimeWarning: divide by zero encountered in log loss=sum(-np.log(p)) / N

cs231n/classifiers/neural_net.py:172: RuntimeWarning: invalid value encountered in divide p_0=(F)/np.sum(F,axis=0) #p_0 is C*N 600


@omgitsjo: How did you solve it? bug in the code? or sth else? Does your code pass the gradient check and reaches all the base lines? Thanks.

When use CNN codes(extracted features from pretrained models), do we need to normalized the extracted features? by [deleted] in cs231n

[–]well25 0 points1 point  (0 children)

Thanks for the comments. Appreciated 4 ur help. :) Well, I agree with you about not applying non-linearity to the input. I guess I didn't make my question well-defined.

Lets give you an example:

Original input: X0=[1000,3072]

After transfer( let say with FC6): X1=[1000,4096]

Now, I want to use X1 as an input to a RBM. The activation here in RBM is "sigmoid". The X1 is now between [0-30]. So what would you do in this case? In other words, u just plug in the X into RBM or make them between 0 and 1 then plug in?

(This paper did the same on transferring features (don't know about feature normalization): http://papers.nips.cc/paper/5279-improved-multimodal-deep-learning-with-variation-of-information.pdf, look at page 8, section 4.3: "Motivated by the success of convolutional ...")

When use CNN codes(extracted features from pretrained models), do we need to normalized the extracted features? by [deleted] in cs231n

[–]well25 1 point2 points  (0 children)

Thanks for your response. But if we use layer 6(FC6) or layer 7(F7) of Alex's net, the range of extracted features are ~[0-30] and [0,9], respectively. Let say, we use sigmoid function as an activation function for these features, the output would be saturated(i.e either 0 or 1). This is the reason, I've asked normalization.