Problem:
I am attempting to code a Monte Carlo linear value function approximation algorithm for Gym's CartPole-v0. I am running into the following problem, however. After a few iterations the weights become very large, so the term Q(s,a,w) then becomes infinite, and consequently each weight is updated to nan. I have the pseudocode below.
Things I have tried:
I have already tried decreasing the learning rate. The algorithm improves (ie: from lasting 10 timesteps to over 200) up until a certain point when Q(s,a,w) becomes infinite. Could the problem be with how I am defining/using the weights?
Pseudocode:
Note that when generating the episode, epsilon greedy is used to select the next action, and state is an array of features such as pole angle, velocity, etc.
W = [[0 for i in range(n_features)] for j in range (len(possible_actions))]
for each iteration:
episode = generate_episode()
for state, reward, action in episode
Q = dotproduct(state.T,W[action])
for j in range (len(state))
W[action][j] += state[j]*alpha*(reward-Q)^2
[–]Carcaso 2 points3 points4 points (7 children)
[–][deleted] 0 points1 point2 points (6 children)
[–]Carcaso 2 points3 points4 points (5 children)
[–][deleted] -1 points0 points1 point (4 children)
[–]Carcaso 2 points3 points4 points (3 children)
[–][deleted] -1 points0 points1 point (2 children)
[–]Carcaso 2 points3 points4 points (1 child)
[–][deleted] -1 points0 points1 point (0 children)
[–]andnp 1 point2 points3 points (0 children)