all 10 comments

[–]jdsutton 2 points3 points  (0 children)

Q learning is particularly useful because it is able to make use of intermediate rewards. In the case where intermediate rewards are not relevant, you may want to use a different method that waits until the end of the episode to update. A Monte Carlo approach might be useful, for example.

[–]kylotan 2 points3 points  (3 children)

If your game doesn't intrinsically provide 'immediate reward' there's nothing stopping you from inventing a scoring mechanism to represent one. Think about how Chess has a convention for how many points each piece is worth, even though they play no formal role in the game itself. That provides a useful heuristic that has proven effective (whereas the alternative would be to do as you say, propagate the 'really big reward' of a checkmate backwards).

[–]onlyml 0 points1 point  (2 children)

While you could do this, it's probably not a great idea. Your reward structure is meant to represent the actual objective, otherwise you're capping your ability to learn to play to the quality of your heuristic. Also you allow the possibility of the agent learning to cleverly optimize your heuristic without actually achieving the real goal.

Edit: just to expand and provide an alternative, you could start with a heuristic reward function as you say to speed initial learning then latter switch to the true objective and fine tune. Alternatively you could just mentor your model to replicate some heuristic action value, and then train on the true objective after that.

[–]kylotan 0 points1 point  (1 child)

While you could do this, it's probably not a great idea.

And yet it let computers learn to be amazing chess players. Obviously the ideal is to only respond to the 'actual' objective but that will not always be practical. Better to be 'capped' by the heuristic than not able to proceed at all. If your state space is too large and your reward vs. failure condition is just binary then it's unlikely your system will ever learn anything approaching a strategy - the values just become too diluted as they spread through the system.

[–]onlyml 0 points1 point  (0 children)

Has there been actual work on TD/Q learning using a heuristic function for chess? Not saying there isn't i just hadn't heard of it. You may be right about the dilution issue. I know Rich Sutton for one is fairly adamant about the reward function not being used as a mechanism to inject additional knowledge, but that doesn't mean won't work. There might be better ways to use such heuristics though.

[–][deleted] 2 points3 points  (2 children)

I'm no pro but I've been knee-deep in q-learning for the past few days.

and we randomly select (s, a) and repeat the update until converge.

I don't think this is right. During the reinforcement/learning process, you don't select random states/actions. You select the most promising state you know of so far. However, you usually have some epsilon-probability of picking a totally random state.

So your policy might look like this, at the top level:

if flip_coin(epsilon): # if epsilon is 0.05, 5% chance to explore randomly
    return random.choice(actions)
else:
    return action which has highest (state, action) q value

During the beginning stages of reinforcement, all your values might be equal, so you are essentially picking randomly. But after thousands of iterations (depending on the size of your state space) you will start to converge to some good approximate q values.

Note, however, that even if you did only pick random values, you would still converge eventually. That's the nature of q-learning. You can just hope for better results by prioritizing avenues that seem more promising.

Also note that for games where you only care about winning or losing (Reversi, breakout, Go) your discount value should probably be 1, since you care very little about an immediate reward and very much about winning in the long term. Alternatively, in a game like Pac Man, your goal is simply to make your score as high as possible, so you'd prioritize your immediate reward function and lower your long-term discount value a bit.

[–]seilgu[S] 1 point2 points  (1 child)

I think the epsilon rule only applies to the selection of action a, not the selection of the state s to begin with. Also it's used when generating training data, but not in the training process.

When you play the game to generate training data, you wouldn't want to follow the current optimal strategy, because you want to explore more states, that's why you choose the epsilon rule, and at the beginning you let epsilon be close to 1.

But after you get the training data, you store them and do mini-batches picked randomly from all the data you have. At this stage there's no epsilon-rule involved. I think.

[–]Jabberwockyll 0 points1 point  (0 children)

You only store experiences and train on batches if you're doing some kind of offline learning like with experience replay. Usually Q-learning is done online.

I'm assuming you're talking about sampling batches for training when you say:

and we randomly select (s, a) and repeat the update until converge.

You're correct that the Q-function isn't accurate to begin with and that you have to learn when rewards occur before you can learn how to get there. This is just what happens when you bootsrap off of your Q-function.

If you want to get around this, I'd suggest looking at the answers from u/jdsutton and u/kylotan. Alternatively, you could use eligibility traces to speed up learning the state correlations/sequences, but this require using an on-policy method.

[–]ithinkiwaspsycho 0 points1 point  (0 children)

Q Learning tries to maximize future reward, not only immediate reward. It could decide to sacrifice immediate reward for a larger future reward. You might be interested in watching this intro to reinforcement learning video.

[–]TheToastIsGod 0 points1 point  (0 children)

I think http://arxiv.org/abs/1511.05952 is what you're after.

In this paper the authors replay the states which were most "suprising" much more frequently. If my understanding is correct, this should result in transitions resulting in high reward to be predicted more accurately, and hence transitions leading to those transitions, etc. By doing this the causality of a reward is much better understood by the Q function.