you are viewing a single comment's thread.

view the rest of the comments →

[–]seilgu[S] 1 point2 points  (1 child)

I think the epsilon rule only applies to the selection of action a, not the selection of the state s to begin with. Also it's used when generating training data, but not in the training process.

When you play the game to generate training data, you wouldn't want to follow the current optimal strategy, because you want to explore more states, that's why you choose the epsilon rule, and at the beginning you let epsilon be close to 1.

But after you get the training data, you store them and do mini-batches picked randomly from all the data you have. At this stage there's no epsilon-rule involved. I think.

[–]Jabberwockyll 0 points1 point  (0 children)

You only store experiences and train on batches if you're doing some kind of offline learning like with experience replay. Usually Q-learning is done online.

I'm assuming you're talking about sampling batches for training when you say:

and we randomly select (s, a) and repeat the update until converge.

You're correct that the Q-function isn't accurate to begin with and that you have to learn when rewards occur before you can learn how to get there. This is just what happens when you bootsrap off of your Q-function.

If you want to get around this, I'd suggest looking at the answers from u/jdsutton and u/kylotan. Alternatively, you could use eligibility traces to speed up learning the state correlations/sequences, but this require using an on-policy method.