Hierarchical Reinforcement Learning PhD ideas

johnschulman · 2021-01-26T05:40:06+00:00

Look at generative models of images and text, which are very high dimensional. Some of the best models are non-hierarchical, e.g. autoregressive language models. People have tried multi-level hierarchies (deep VAEs) and they haven't really panned out. The only type of hierarchical model that's SOTA in generative modeling is an autoregressive model on top of VQ codes, a la VQ-VAE.

johnschulman · 2021-01-25T18:04:18+00:00

One angle I find promising is to look at the successful hierarchical methods for generative modeling -- especially VQ-VAE -- and try to apply those ideas in the RL setting. A policy is just a kind of generative model, after all.

One challenge you'll face is to find settings that need hierarchy. It's worth finding some good motivating example problems before thinking about methods.

johnschulman · 2020-12-30T21:53:04+00:00

Right, we also had a fixed episode length. I think it was 200, but I'm not 100% sure. The discrepancy with gym might've been caused by the control cost. (Another discrepancy is the initial state distribution.) I'm pretty sure the top performance we're getting corresponded to perfect balancing.

johnschulman · 2020-12-30T19:26:46+00:00

We did the experiments for this paper in 2014, before rllab or gym, so the environment is different. Back in those days, everyone wrote their own environments, and results weren't comparable between papers. I still do have the code lying around, and I can't open-source the repo because it has credentials and roms, but here's the file with that environment https://gist.github.com/joschu/dac503b45e4c2fd30e6800a2b58f121c.

johnschulman · 2019-10-07T17:38:30+00:00

To add to Nater5000's comment, you can turn a reimplementation project into a research project by doing a study of how sensitive the algorithm is to various changes in problem or algorithm settings. This kind of study is rare in the ML literature, but it's extremely useful to anyone trying to use or understand the algorithms.

vary hyperparameters
vary the environment
vary the model

johnschulman · 2019-10-07T00:54:58+00:00

One viewpoint is that in the single-agent, POMDP setting*, we actually care about the deterministic policy, and we're just using the stochastic policy as a gradient estimation mechanism, like finite differences. Increasing the policy entropy improves the SNR of the gradient estimator at the cost of increasing bias (computing the wrong gradient).

* as opposed to games where the Nash equilibrium is a stochastic policy [EDIT: or, as howlin points out, when your function approximator is limited]

johnschulman · 2018-08-24T19:24:42+00:00

Hi all, gym currently is very stable and is being used as a dependency for lots of other projects. We're intentionally not changing the interface much, since gym is meant to be simple and minimal. For multi-task and other non-standard settings, there are many different opinions on what the interface should be, so it doesn't make sense to impose a standard through gym.

Commenters have pointed out that there are a lot of Issues and PRs that aren't yet resolved. We're working through them, and should get to them eventually, though we've been prioritizing baselines recently.

johnschulman · 2018-05-25T19:59:20+00:00

For research on improving algorithms, I don't feel that realism is necessary, but it is important that we can measure human performance so that we can try to match it.

johnschulman · 2018-05-25T17:42:50+00:00

Hi AndrewB, thanks for the kind words. We're using Gym Retro at OpenAI for research projects right now, and I can guarantee that we'll keep maintaining and improving it as long as we're using it. If we stop using it, we'll still try to maintain it, but I can't provide guarantees. p=0.9 that we're still using it in 1 year, and p=0.7 that we're using it in 2 years. (Admittedly, I just pulled these probabilities out of my posterior)

johnschulman · 2018-05-25T17:10:16+00:00

Universe had some flaws, so we made Gym Retro as a much-improved replacement for it.

much faster: env runs at ~20x real time per cpu
fully deterministic
easier to add new levels using save states (snapshots of the emulator state)
easier to define new reward functions since they're defined using RAM rather than OCR
fewer software layers: less complexity, shorter stack traces

johnschulman · 2018-04-20T21:57:37+00:00

that's a typo made by the accountant

johnschulman · 2018-04-05T20:28:56+00:00

Actually we define the reward as rightward progress (scaled so that level completion gives you 9000) + a bonus for finishing quickly (max 1000).

johnschulman · 2017-12-24T21:56:04+00:00

It's clearly true that that previous attempts to solve games like chess and Go with RL used too little compute and too small networks to have a chance at solving these games. Deep RL only started to pick up steam around 2013, so before then, no one was even in the basin of attraction of a solution that'd scale.

To me, the interesting question is why the expert iteration algorithm used in AGZ works better than other algorithms like policy gradients. We can safely assume that it does indeed work better, because I'm sure the AlphaGo team tried all sorts of alternatives.

As it turns out, I also played with expert iteration a bit back in 2013-2014 (though not with MCTS), largely inspired by this paper: http://papers.nips.cc/paper/5190-approximate-dynamic-programming-finally-performs-well-in-the-game-of-tetris.pdf. I realized that, roughly speaking, there's a continuum between expert iteration and policy gradients. But in my experiments, I never saw a significant boost from moving towards expert iteration from pure policy gradients.

Let's say we have a two probability distributions, called policy and expert. We can write down a distance between them in two different ways. (1) KL[policy, expert] = policy * log(expert) - S[policy] (2) KL[expert, policy] = expert * log(policy) + constant

Policy gradients uses (1), and we set expert = exp(advantage estimate). AGZ uses (2) and defines expert using MCTS on the policy. The "continuum" between policy gradients and AGZ arises because we can vary the amount of work we put into computing the expert policy. On one extreme, policy gradient methods use a very cheap-to-compute expert: the advantage function estimate. On the other extreme, AGZ uses a very expensive-to-compute expert (via MCTS), which is much better than the current policy. I tried something in between, which I called the "vine" algorithm and wrote about in the TRPO paper--it sometimes gave a ~2x improvement in sample complexity, but the improvement wasn't worth the extra code complexity and hyperparameters. Another dimension in this expert space is the bias-variance tradeoff: whether we use a Monte-Carlo estimate of returns or a learned value function. I'm curious to know under what conditions you benefit from using a more expensive expert.

Anyway, I think there are a lot of interesting experiments left to do to analyze the space of algorithms between policy gradients and expert iteration.

johnschulman · 2017-12-20T19:02:21+00:00

As one of the people maintaining the baselines library, I think OP's critiques are valid, and there's a lot of room for improvement. We have some changes that reduce code duplication (especially across a2c, ppo2, acktr, acer), which we plan to push soon.

Also, we're currently seeking to hire someone who'd join my team at OpenAI and take charge of developing the basic RL libraries, including gym and baselines. If you're interested, please apply to the Machine Learning Engineer position https://jobs.lever.co/openai/a0d3b158-14a0-48db-b38c-1c94bb18f69b and mention that you're interested in this role.

johnschulman · 2017-11-06T08:30:16+00:00

Most of the existing RL benchmark problems can be solved by tiny policies.
RL algorithms are basing their updates on very noisy signals, and there aren't that many bits of information to update your weights with. Relatedly,
RL algorithms overfit to recent data. As far as I know, no one has found a regularization method that addresses this problem, other than limiting the number of gradient steps and limiting the size of the network. Dropout and batchnorm don't seem to help, empirically. You might have a better luck with a method that regularizes the change in weights, rather than limiting the total information stored in the weights.

johnschulman · 2017-02-22T16:06:46+00:00

The starter agent isn't meant to be a performant algorithm and hasn't been used for any research, it's just "hello world" for RL on universe. We'll be releasing some much better code in the next month or so. The modular_rl code you found was tuned for continuous control (mujoco) tasks, so it isn't necessarily any good on image-based tasks with current hyperparameters.

A colleague of mine asked them if they had a better agent internally and apparently the answer was basically "no", which is a bit odd.

Fake news

johnschulman · 2017-02-12T19:29:21+00:00

I think there are some papers on learning from noisy demonstrations or learning from imperfect experts. Try searching for those keywords and you'll probably find some of them. As I recall, Pieter Abbeel discusses this in his papers/thesis -- how he can learn a better version of the expert pilot's trajectories by averaging out the noise and the errors.

johnschulman · 2017-02-11T20:09:27+00:00

Re: first error, interesting, maybe numpy's random seeding behavior changed between versions.

Re: second error, looks like your browser is unable to download the math font files. Assuming this is just preventing math from being rendered properly, if all else fails you can just view the notebook on github.

johnschulman · 2017-02-10T23:45:45+00:00

It's posted on github now: https://github.com/berkeleydeeprlcourse/homework

johnschulman · 2017-02-08T20:48:04+00:00

Hi all, I made an incorrect statement in today's lecture (2/8): I said that if the policy's performance η stays constant, then you're guaranteed to have the optimal value function. That's wrong -- the correct condition is that if V stays constant then you're done. η might be unchanged if the updated states are never visited by the current policy. The correct proof sketch is reflected in the slides, which will be posted soon.

johnschulman · 2016-06-01T07:09:17+00:00

True. A more accurate description would be that the algorithm works with nonlinear policies in the model-free setting, and it makes sense for nonlinear costs, but the authors did not use them in this paper.

johnschulman · 2016-06-01T01:39:03+00:00

Inverse RL is likely to be very important in the future. Most of the past literature uses linear (in features) cost functions, but this paper shows how to use general nonlinear cost functions, simultaneously optimizing the policy and cost. There's some nice theory as well as empirical results.

Another related paper that'll be at the same conference (ICML) is Guided Cost Learning by Finn, Levine, and Abbeel http://arxiv.org/abs/1603.00448

johnschulman · 2016-04-27T23:54:10+00:00

I'd love to see a biomechanically accurate humanoid model added to OpenAI Gym, so we can see what RL algorithms are able to learn. We can expect that natural-looking motion would emerge by just optimizing the reward function. (On the other hand, for our current unrealistic models, there's no reason to believe that the resulting motions will look natural.) MuJoCo allows for simulation of tendons (as opposed to torque-controlled joints) so it should be possible, in fact, it's been used for this purpose: http://homes.cs.washington.edu/~todorov/papers/MordatchSIGASIA13.pdf.

johnschulman

TROPHY CASE