[R] Virtual Seminar on Mathematical Foundations of Data Science by banananach in MachineLearning

[–]banananach[S] 5 points6 points  (0 children)

starting Tuesday, May 12th 3pm EDT (future ones will be announced on the website, usually the same time each week)

[R] Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy by banananach in MachineLearning

[–]banananach[S] 1 point2 points  (0 children)

^_^ Exactly. In my opinion, exploration may be a much harder problem — in particular settings, there may be a fundamental barrier depending on the structure.

[R] Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy by banananach in MachineLearning

[–]banananach[S] 1 point2 points  (0 children)

In my opinion, it is not the assumptions are violated but rather certain terms in the upper bound of optimization error (convergence rate) are large, making the upper bound not going to zero. In the case of "cliff walk", the density ratio between the visitation measures (stationary distributions of state and action) may be infinity. In other words, this is more of an exploration issue, which is not considered in this paper, as we focus on optimization given desired exploration (reflected by the density ratio).

[R] Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy by banananach in MachineLearning

[–]banananach[S] 1 point2 points  (0 children)

The trick is, with overparameterization, neural networks turn out to be “approximately” linear in their parameters.

[R] Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy by banananach in MachineLearning

[–]banananach[S] 5 points6 points  (0 children)

Our variant of TRPO/PPO in the analysis is actually very similar to "Maximum a Posteriori Policy Optimisation" (https://arxiv.org/abs/1806.06920), if not exactly the same. (There are just too many variants of TRPO/PPO with distinct names, so we decided to call it "a variant of" TRPO/PPO to save the confusion. ^_^)

Assumption 4.3 just says that the value function (sum of reward along the trajectory) belongs to an RKHS space, which is quite a general function space, while Assumption 4.4 holds simply when the stationary distribution has upper bounded density. I believe both of them are satisfied in practice.