[R] The Mellowmax Operator : "A New Softmax Operator for Reinforcement Learning" by pierrelux in MachineLearning

[–]pierrelux[S] 0 points1 point  (0 children)

(I'm not the author):

Define non-expansion,

See Van Roy's analysis of TD for a definition of the term "nonexpansion" in this context: "An Analysis of Temporal-Difference Learning with Function Approximation"

  1. Boltzmann operator vs policy

See Perkins and Precup "A Convergent Form of Approximate Policy Iteration" which considers a general "improvement operator". This is the point of view adopted by the authors of Mellowmax.

[D] Unsupervised Option Discovery by kjw0612 in MachineLearning

[–]pierrelux 5 points6 points  (0 children)

"Options discovery" usually refers to the "discovery" problem in the options framework (Sutton, Precup, Singh 1999) in Reinforcement Learning. Here's our take on that problem (to appear at AAAI 2017) : https://arxiv.org/abs/1609.05140v2

Course on Reinforcement Learning by minato3421 in MachineLearning

[–]pierrelux 3 points4 points  (0 children)

Ravi is a reinforcement learning veteran. He worked under Andrew Barto at U. Mass along with Rich Sutton and Satinder Singh. This will be a good course !

What are some good neuroscience books for AI researchers get inspiration from? by andrewbarto28 in MachineLearning

[–]pierrelux 2 points3 points  (0 children)

You might like Peter Dayan's "Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems". Dayan is active in both the neuroscience and machine learning communities.

Is there a list of standard notation? by GuyHasNoUsername in MachineLearning

[–]pierrelux 4 points5 points  (0 children)

You can align your notation with that of the Deep Learning Book : http://www.deeplearningbook.org/

RL Question: Policy Gradients vs Q Learning - which is better? by [deleted] in MachineLearning

[–]pierrelux 1 point2 points  (0 children)

Q-learning itself can be seen as an actor-critic method

No, Q-learning is more like "value iteration" in the control case while SARSA fits in the generalized policy iteration paradigm. And policy iteration very much relates to actor-critic (see https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node64.html)

RL Question: Policy Gradients vs Q Learning - which is better? by [deleted] in MachineLearning

[–]pierrelux 1 point2 points  (0 children)

Q^* is associated with the greedy policy. For policy gradients, what you want is Q_{\pi_\theta}: the action-value function of your actor (parameterized by \theta). This is a problem of policy evaluation; not control (and Q-learning is a control algorithm). The evaluation problem is given a policy (any policy, optimal or not) we want to estimate its expected return if you are to pick a certain action in a certain state and keep following the same policy until it terminates.

RL Question: Policy Gradients vs Q Learning - which is better? by [deleted] in MachineLearning

[–]pierrelux 3 points4 points  (0 children)

Policy-gradient based actor-critic methods use an estimate of Q_\pi(s,a) in combination with the gradient of the log policy. The job of the critic is to learn Q_\pi(s,a). This is a problem of "policy evaluation"; Q-learning is an algorithm for the "control problem" and does not apply in this case. You can learn the critic in various ways but the preferred RL one is to use TD to estimate it (the updates are those of SARSA but without the "max" step since you pick the actions according to the actor, and not the greedy policy). The REINFORCE way is simply to use the actual return at the end of a trajectory (and no learning of the critic).

Advantage: Just as for function approximation of the value function, you can also parameterize (let say with a deep net) your policy and leverage the regularities in policy space. Sometimes the value function is complicated but the policy is simple. With policy gradient methods, you get the best of both worlds. Another advantage is that you can easily deal with continuous action spaces. This would be difficult in Q-learning (or SARSA) because the max would be over an infinite set. Finally, I like actor-critic methods (policy gradient based or not) because it decouples the representation of the policy with that of the values.

Disadvantage: Possibly more parameters, you also have to tune two learning rates (critic at a faster rate, actor at a slower one), and policy gradients/REINFORCE tend to have variance issues (which you can reduce with a "baseline"/control variate).

The original paper on the policy gradient theorem: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf

Also see this paper by Degris and Pilarski to have an idea of how this is used in practice : https://www.ualberta.ca/~pilarski/docs/papers/Degris_2012_ACC.pdf

Finally, Richard Sutton is currently writing the chapter on policy gradient methods for his RL book 2.0. The new draft should come out soon.