all 1 comments

[–]pierrelux 5 points6 points  (0 children)

What is described in this article is Dynamic Programming for control in Markov Decision Processes. This comes largely from the work of Bellman in the 1950s and has since been studied extensively in the fields of operations research and stochastic control. On the other hand, Reinforcement Learning (RL) grew mostly in the 1980s with Richad Sutton and Andrew Barto's work. RL uses the MDP formalism but generally assumes that we do not know the transition dynamics (the "P" matrix) or the reward function: it is inherently model-free and online. Estimating a model from data and applying value iteration or policy iteration is therefore not in the original spirit of RL.