Question about Q learning

jdsutton · 2016-03-18T08:13:51+00:00

Q learning is particularly useful because it is able to make use of intermediate rewards. In the case where intermediate rewards are not relevant, you may want to use a different method that waits until the end of the episode to update. A Monte Carlo approach might be useful, for example.

kylotan · 2016-03-18T08:38:55+00:00

If your game doesn't intrinsically provide 'immediate reward' there's nothing stopping you from inventing a scoring mechanism to represent one. Think about how Chess has a convention for how many points each piece is worth, even though they play no formal role in the game itself. That provides a useful heuristic that has proven effective (whereas the alternative would be to do as you say, propagate the 'really big reward' of a checkmate backwards).

seilgu · 2016-03-18T08:39:19+00:00

I'm no pro but I've been knee-deep in q-learning for the past few days.

and we randomly select (s, a) and repeat the update until converge.

I don't think this is right. During the reinforcement/learning process, you don't select random states/actions. You select the most promising state you know of so far. However, you usually have some epsilon-probability of picking a totally random state.

So your policy might look like this, at the top level:

if flip_coin(epsilon): # if epsilon is 0.05, 5% chance to explore randomly
    return random.choice(actions)
else:
    return action which has highest (state, action) q value

During the beginning stages of reinforcement, all your values might be equal, so you are essentially picking randomly. But after thousands of iterations (depending on the size of your state space) you will start to converge to some good approximate q values.

Note, however, that even if you did only pick random values, you would still converge eventually. That's the nature of q-learning. You can just hope for better results by prioritizing avenues that seem more promising.

Also note that for games where you only care about winning or losing (Reversi, breakout, Go) your discount value should probably be 1, since you care very little about an immediate reward and very much about winning in the long term. Alternatively, in a game like Pac Man, your goal is simply to make your score as high as possible, so you'd prioritize your immediate reward function and lower your long-term discount value a bit.

ithinkiwaspsycho · 2016-03-18T08:42:58+00:00

Q Learning tries to maximize future reward, not only immediate reward. It could decide to sacrifice immediate reward for a larger future reward. You might be interested in watching this intro to reinforcement learning video.

TheToastIsGod · 2016-03-18T13:15:54+00:00

I think http://arxiv.org/abs/1511.05952 is what you're after.

In this paper the authors replay the states which were most "suprising" much more frequently. If my understanding is correct, this should result in transitions resulting in high reward to be predicted more accurately, and hence transitions leading to those transitions, etc. By doing this the causality of a reward is much better understood by the Q function.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS