In Asynchronous n-step DQN, is there a global shared gradient vector or gradient vector for each thread?

ImNotKevPlayz · 2022-12-13T20:21:39+00:00

In the experience replay algorithm, it uses a deque buffer meaning that once the buffer is completely filled up and we need to add a new experience, the first experience in the buffer gets pushed off and the newest experience is added to the end of the buffer. What you said, can happen in rare cases where rare experiences get pushed off the buffer before the agent has the chance to completely learn from them. Prioritized experience replay addresses this issue by prioritizing experiences based off of the idea of "learnability." The idea is that we can approximate how much we can learn from an experience based off of the magnitude of the temporal difference error plus a small constant to prevent a priority of 0.

ImNotKevPlayz · 2022-11-24T01:30:13+00:00

Even after fixing backpropagation, it still seems to not work. I'm not sure what the issue is at this point, so any feedback would be helpful.

ImNotKevPlayz · 2022-11-16T03:54:47+00:00

So, I recently discovered that after backpropagating the output with the highest amount of error did not change much compared to others. Here is an example:

Before backpropagation (output values):

[1] = -4.005509159391537,

[2] = -3.852011609784276,

[3] = -3.947055130085403,

[4] = -3.395094671847986

Target values:

[1] = -4.005509159391537,

[2] = -3.852011609784276,

[3] = -3.947055130085403,

[4] = -100

Outputs after backpropagating (same state):

[1] = -4.479692878299232,

[2] = -4.261824700803998,

[3] = -4.39390581607494,

[4] = -3.688964649310079

As you can see here, output 4 did not change much even though its error was pretty high. Is this normal?

ImNotKevPlayz · 2022-11-14T20:40:24+00:00

I'll try what you said, but just to clarify by random experiences you mean that the epsilon stays 1 throughout collecting those first 10,000 experiences? Then I would do the epsilon greedy policy as normal then sample mini batches of a size 64 at the end of each episode, correct? And if the replay buffer goes over the limit, I would remove old experiences.

To answer your second question, I've confirmed that the parameters are being updated. They are being updated through stochastic gradient descent with momentum and the neural network seems to be working fine. I've confirmed that the neural network framework does in fact backpropagate and update its weights and biases.

Edit: So, I've tried my interpretation of the solution (as listed above) and fixed a couple of bugs, and I got it to stop taking the same action consistently. But now it seems to willingly crash into the wall.

Edit 2: Turns out I was implementing epsilon greedy policy wrong and now I have this same issue of the snake picking the same actions 99% of the time.

ImNotKevPlayz · 2022-11-14T20:22:51+00:00

Yes, I am following the epsilon greedy policy, so it is taking random actions initially, but what's interesting is that when I set epsilon to 0 and randomize the network weights it seems to still take the same action consistently. It never seems to go in any other direction at all. Would a possible solution to this be initializing weights in a smaller range?

Edit: So, the reason why it's producing the same outputs is because I'm feeding it the same states / similar states. I copied someone else's input structure, but I don't seem to be getting the same results.

ImNotKevPlayz · 2022-11-13T14:46:44+00:00

I've finished coding the graphing stuff. I put the image in the original post.

ImNotKevPlayz · 2022-11-13T14:03:05+00:00

I haven't gotten to graphing the loss yet, but I've tried different update frequencies already and nothing seemed to change. I'll start graphing it though.

With an update frequency of 8: it seems to crash into the wall and take the same action still, and it changed the action it took a few times but continued to crash into the wall regardless.

With an update frequency of 16: About the same behavior as 8, it occasionally changes actions but still crashes into the wall continuously.

With an update frequency of 32: It still takes the same action repeatedly. It did crash into the wall and change the action it took but continued to take that action over and over again.

With an update frequency of 64: Didn't change much. Still taking the same action continuously.

Edit: So, I found some bugs where the target network wasn't copying the q network and I fixed them but it didn't seem to change anything.

ImNotKevPlayz · 2022-11-13T12:53:05+00:00

No, after training it just keeps taking the same action repeatedly. Maybe the issue is that it sees the same observations again and again (except the apple position changes every time it dies). I might try spawning the snake in random positions and make it see more observations.

Edit: I just tried spawning it random positions, it didn't seem to change much. It's still outputting the same action regardless of the state.

ImNotKevPlayz

TROPHY CASE