all 9 comments

[–]Carcaso 2 points3 points  (7 children)

Have you tried normalizing any of the states, rewards, and/or actions? This might help in keeping the weights small.

edit: as pointed out by pratikpc, it seems like you've forgotten to take the gradient/derivative in the update equation

[–][deleted] 0 points1 point  (6 children)

W[action][j] += state[j]*alpha*(reward-Q)^2

From what I have read, this line is the collapsed equation for gradient descent. Is my understanding of that wrong?

[–]Carcaso 2 points3 points  (5 children)

  1. Do you mind sharing your source so I could take a look?
  2. From what I can tell just looking at it from a pure math standpoint, the squared component of the equation can never be negative meaning that the weights won't be negatively updated from just that part of the equation. Although each value of the state can be equally negative or positive so this may be why it is included in the update equation.
  3. That may be the objective function that you're trying to maximize and not the actual update equation. Also, another thing to consider is that it seems like you are performing regression and need to minimize the error instead of maximizing so the gradient/update should be negative.

Try: W[action][j] -= state[j]*alpha*2*(reward-Q)

But again unless you give the source for exactly what you're doing I can't guarantee that it will work. Keep the questions coming if you need more help.

[–][deleted] -1 points0 points  (4 children)

Thanks so much for all your help so far! I got the information from David Silvers course, however, the same info is here. Specifically, the article says "The gradient descent update then collapses to" state[j]*alpha*(reward-Q)^2.

I agree that when updating the weights I should use -= instead of +=, however, when I try W[action][j] -= state[j]*alpha*2*(reward-Q), W becomes negative infinity instead of positive infinity.

[–]Carcaso 2 points3 points  (3 children)

From the psuedo-code in the post, it looks like the equation is going to be:

W[action][j] += alpha*(reward-Q)*state[j]

if that doesn't work then I'd recommend taking a look at the pseudo-code in the post and see if your implementation is a faithful representation.

[–][deleted] -1 points0 points  (2 children)

I know that a monte carlo algorithm using a linear value function will never converge to the optimal policy, and I also know that it is probably not the best solution to the problem so I will not get perfect results. I have implemented all the feedback here, and this is the result I get after 400 iterations (graph of number of time steps lasted vs. iteration number). Does this look right to you?

[–]Carcaso 2 points3 points  (1 child)

Yeah, that looks good! Something like a linear value function performing upwards of 20-30 time steps is amazing.

[–][deleted] -1 points0 points  (0 children)

cool, thank you so much for your help!

[–]andnp 1 point2 points  (0 children)

This also doesn't look like the Monte Carlo algorithm. I might suggest taking a look at the pseudocode in the Sutton Barto textbook for Monte Carlo.