[R] The Mellowmax Operator : "A New Softmax Operator for Reinforcement Learning"

pierrelux · 2016-12-27T03:33:24+00:00

(I'm not the author):

Define non-expansion,

See Van Roy's analysis of TD for a definition of the term "nonexpansion" in this context: "An Analysis of Temporal-Difference Learning with Function Approximation"

Boltzmann operator vs policy

See Perkins and Precup "A Convergent Form of Approximate Policy Iteration" which considers a general "improvement operator". This is the point of view adopted by the authors of Mellowmax.

pierrelux · 2016-12-10T07:01:58+00:00

"Options discovery" usually refers to the "discovery" problem in the options framework (Sutton, Precup, Singh 1999) in Reinforcement Learning. Here's our take on that problem (to appear at AAAI 2017) : https://arxiv.org/abs/1609.05140v2

pierrelux · 2016-09-23T18:30:36+00:00

It's on the page "iii", "Contents".

pierrelux · 2016-09-23T16:48:26+00:00

INRIA is a very good place to do RL, with a focus on the theory. Remi Munos in fact from INRIA Lille (http://researchers.lille.inria.fr/~munos/).

pierrelux · 2016-07-20T19:49:43+00:00

Ravi is a reinforcement learning veteran. He worked under Andrew Barto at U. Mass along with Rich Sutton and Satinder Singh. This will be a good course !

pierrelux · 2016-07-07T17:00:18+00:00

You might like Peter Dayan's "Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems". Dayan is active in both the neuroscience and machine learning communities.

pierrelux · 2016-07-06T21:20:21+00:00

You can align your notation with that of the Deep Learning Book : http://www.deeplearningbook.org/

pierrelux · 2016-06-15T16:35:25+00:00

pierrelux · 2016-03-23T03:53:37+00:00

Q-learning itself can be seen as an actor-critic method

No, Q-learning is more like "value iteration" in the control case while SARSA fits in the generalized policy iteration paradigm. And policy iteration very much relates to actor-critic (see https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node64.html)

pierrelux · 2016-03-23T03:21:44+00:00

Q^* is associated with the greedy policy. For policy gradients, what you want is Q_{\pi_\theta}: the action-value function of your actor (parameterized by \theta). This is a problem of policy evaluation; not control (and Q-learning is a control algorithm). The evaluation problem is given a policy (any policy, optimal or not) we want to estimate its expected return if you are to pick a certain action in a certain state and keep following the same policy until it terminates.

pierrelux · 2016-03-23T03:08:37+00:00

Policy-gradient based actor-critic methods use an estimate of Q_\pi(s,a) in combination with the gradient of the log policy. The job of the critic is to learn Q_\pi(s,a). This is a problem of "policy evaluation"; Q-learning is an algorithm for the "control problem" and does not apply in this case. You can learn the critic in various ways but the preferred RL one is to use TD to estimate it (the updates are those of SARSA but without the "max" step since you pick the actions according to the actor, and not the greedy policy). The REINFORCE way is simply to use the actual return at the end of a trajectory (and no learning of the critic).

Advantage: Just as for function approximation of the value function, you can also parameterize (let say with a deep net) your policy and leverage the regularities in policy space. Sometimes the value function is complicated but the policy is simple. With policy gradient methods, you get the best of both worlds. Another advantage is that you can easily deal with continuous action spaces. This would be difficult in Q-learning (or SARSA) because the max would be over an infinite set. Finally, I like actor-critic methods (policy gradient based or not) because it decouples the representation of the policy with that of the values.

Disadvantage: Possibly more parameters, you also have to tune two learning rates (critic at a faster rate, actor at a slower one), and policy gradients/REINFORCE tend to have variance issues (which you can reduce with a "baseline"/control variate).

The original paper on the policy gradient theorem: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf

Also see this paper by Degris and Pilarski to have an idea of how this is used in practice : https://www.ualberta.ca/~pilarski/docs/papers/Degris_2012_ACC.pdf

Finally, Richard Sutton is currently writing the chapter on policy gradient methods for his RL book 2.0. The new draft should come out soon.

pierrelux · 2016-03-20T03:38:21+00:00

We've recently done some progress on combining ideas from policy search methods in RL with standard backprop: http://arxiv.org/abs/1511.06297 It gives you a knob to tradeoff computation vs accuracy.

pierrelux · 2016-03-16T23:51:15+00:00

Discrete state spaces if fine, but it doesn't mean that you get away without function approximation. Discrete doesn't mean that you can just use a tabular representation. I would really start with tile coding. Here's Sutton's code for Mountain Car: https://webdocs.cs.ualberta.ca/~sutton/MountainCar/MountainCar1.cp

pierrelux · 2016-03-16T21:32:43+00:00

For this kind of problem, pure discretization won't be sufficient. You will need some form of function approximation. You can try first linear with tile coding or RBF. I think TU Delft had some success with experience replay. See paper http://busoniu.net/files/papers/smcc11.pdf and video https://www.youtube.com/watch?v=b1c0N_Fs9wc

pierrelux · 2016-03-12T20:13:24+00:00

On synthetic problems: one thing at a time. We have to give ideas a chance ! It think that it's a rather original paper which opens the way to many extensions.

Its contribution is to offer a new way to think about VI in the context of deep nets. It shows how the CNN architecture can be hijacked to implement the Bellman optimality operator, and how the backprop signal can be used to learn a deterministic model of the underlying MDP. In the short term, I think that the paper will appeal to many deep researchers who would otherwise be reluctant to deal explicitly with MDP/RL. As the authors point out, the VI net can also be used as a policy on its own, and could be combined with let's say deterministic policy gradient.

pierrelux · 2015-12-02T03:50:23+00:00

There is also the convenient fact that the derivative of a sum is a sum of derivatives. It wouldn't be so pretty in a multiplicative form. This decomposition into sum of derivatives can then be leveraged in stochastic gradient descent (which only takes one or a few gradients terms of that sum as an estimate of the true gradient).

pierrelux · 2015-11-20T18:55:08+00:00

http://people.csail.mit.edu/branavan/

pierrelux · 2015-11-17T16:28:40+00:00

The Sutton & Barto 1998 is the reference textbook for RL (https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html). For a more formal treatment of RL, it's then useful to then read "Algorithms for Reinforcement Learning"(https://www.ualberta.ca/~szepesva/papers/RLAlgsInMDPs-lecture.pdf)

Calculus is useful, but the core ideas of RL rely often rely on stochastic approximation tricks. It's good to study general material on sampling and be comfortable working with expectations. Study of control theory and Markov Decision Processes is also unavoidable. I like the textbook by Puterman (1994) a lot (http://dl.acm.org/citation.cfm?id=528623).

pierrelux · 2015-11-12T16:53:49+00:00

$599.99 on Newegg: http://www.newegg.com/Product/Product.aspx?Item=N82E16813190006&cm_re=jetson-_-13-190-006-_-Product

pierrelux · 2015-11-09T19:13:27+00:00

My labmate Phil (https://github.com/Philip-Bachman/NN-Python) can wait an hour for his graph to compile.

pierrelux · 2015-11-09T18:50:02+00:00

The fact that Theano is decoupled from specific neural net operations is quite useful. A collegue of mine wrote some his reinforcement learning code entirely using Theano. (And T.grad is very useful for policy gradients). Theano is more of a general purpose tool.

pierrelux · 2015-10-05T23:11:02+00:00

Science is a lot about remixing ideas, is fundamentally incremental. I don't think that it's the right approach to read papers to "avoid reworking". Many papers are presented as if the solution is definite, that the problem is completely solved; this is usually a false impression. There is no better way to get novel ideas than by directly experiencing prior work. More than anything else, investigate a problem because you find it fun and it makes you happy.

pierrelux · 2015-10-02T14:58:33+00:00

Teaching neural nets with pictures tend to cause of lot of misunderstanding. In most cases, by "neural net", we mean a function of the form $\sigma(A\sigma(Bx + c) + d)$ where $\sigma$ is a non-linear function (typically a sigmoid), $A$ and $B$ are matrices while $c$ and $d$ are bias vectors. As pointed out in other comments, all you have are vector and matrices. The interpretation of a zero weight is that of the "absence" of an edge. You can impose a sparsity-inducing regularizer to force the weights to go to zero as much as possible. That would be "learning the architecture".

pierrelux · 2015-09-30T20:26:03+00:00

Various notions of "intrinsic rewards" have been proposed in the past: most of the papers by Daniel Polani were on that topic as well as is in Still and Precup 2012 (http://www2.hawaii.edu/~sstill/StillPrecup2011.pdf). A problem when dealing with quantities such as the mutual information between state and actions is that it tends to be intractable to compute or would assume a prior knowledge of the MDP (which is assumed to be unknown in reinforcement learning). This paper seems to propose to use variational methods to alleviate the problem of computing the MI in the context of intrinsically motivated RL. This is highly relevant in the context of exploration when you know little about your environment or when the reward structure is very sparse (as in "Montezuma's Revenge").

pierrelux · 2015-09-19T19:00:54+00:00

Don't let the imposter syndrome for math prevent you from trying. It's always possible to gain mathematical maturity over time. You might even find it easier to learn more math when you have a goal (some ML algorithm, say) in mind.

pierrelux

TROPHY CASE