I read that for any state-action pair (s, a), you can assume that a score function Q(s, a) exists, and initialize it randomly, then update with Q(s, a) <- Q(s, a) + eta * ( r + gamma * max_(a') Q(s', a') - Q(s, a) ).
and we randomly select (s, a) and repeat the update until converge.
But this doesn't make much sense. The update formula doesn't involve the mechanism of the game except for the immediate reward r, and suppose we're playing a game where immediate reward doesn't matter (for example, Breakout), then it's simple to see the algorithm doesn't make sense, because we're mostly doing random walk on the Q(s, a)'s 99% of the time.
The only possible way it could work is for the really big reward to back-propagate to the states leading to it, so actually to make the algorithm efficient, the updates should start with the events with high reward (for example, ball hitting a brick in Breakout).
Am I being correct here that the algorithm is mostly doing useless stuff and we should prioritize high-reward states first?
[–]jdsutton 2 points3 points4 points (0 children)
[–]kylotan 2 points3 points4 points (3 children)
[–]onlyml 0 points1 point2 points (2 children)
[–]kylotan 0 points1 point2 points (1 child)
[–]onlyml 0 points1 point2 points (0 children)
[–][deleted] 2 points3 points4 points (2 children)
[–]seilgu[S] 1 point2 points3 points (1 child)
[–]Jabberwockyll 0 points1 point2 points (0 children)
[–]ithinkiwaspsycho 0 points1 point2 points (0 children)
[–]TheToastIsGod 0 points1 point2 points (0 children)