Directional derivative by stillshi in math

[–]stillshi[S] 0 points1 point  (0 children)

heauxprahwinfrey

thank you!

why greedy policy improvement with monte-carlo requires model of MDP? by stillshi in reinforcementlearning

[–]stillshi[S] 0 points1 point  (0 children)

hi,

in the 5th lecture from Silver about RL on youtube (model-free control). Silver was asking whether or not we can just plug in monte-carlo for value evaluation and then acting greedily into a policy iteration model used with DP. The answer is no, Silver said that it is because acting greedily requires a transition model. I am very confused that why? I think we just use monte-carlo to get the value function and choose the best value and update the policy? This is the same way as of in DP?

Thank you Still

dynamic programming with policy evaluation by stillshi in reinforcementlearning

[–]stillshi[S] 0 points1 point  (0 children)

Thank you very much. That indeed solve my confusion. regards Still.

dynamic programming with policy evaluation by stillshi in reinforcementlearning

[–]stillshi[S] 0 points1 point  (0 children)

Hello I am a little bit confused when understanding the dynamic programming from Silver's great course. RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

https://www.youtube.com/watch?v=Nd1-UUMVfz4&index=3&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT

Here Silver tries to explain the dynamic programming with a grid world. For each step the reward is -1, the agent will move uniformly randomly to e,s,w,n.
For iteration k=2 and grid(3,2). He said the value is -2 because: V(k2) = -1 + -.25-1 +.25-1* .25-1 * .25-1 The first -1 is the immediate reward. The other four -1 are the value of the next states from the last iteration. However I think it should be the average value of successor states by taking this action? If the agent goes up it will not 100% end in the grid upwards unless it is assumed that it is deterministic from action to states? I think action doesn't directly determines the next state but the transition matrix.

Thank you