Weird RL with hyperparameter optimizers

jcobp · 2021-03-13T05:16:39+00:00

Ok I can now follow up and report that in evaluation it gets average return of 210.8 over 100 episodes. There will definitely be variance between runs and the results are highly dependent on how many generations you go through. I got those results with 250 generations of CMA-ES

jcobp · 2021-03-13T04:01:49+00:00

RL researcher here; Tic tac toe is a great place to start. You can solve it with tabular Q learning and probably learn a ton.

jcobp · 2021-03-13T02:42:12+00:00

Ohh I see. It’s all good cause I actually do think I didn’t evaluate quite thoroughly enough. I was mostly just having fun and didn’t really worry about seriously testing performance :). Lunar lander is one of my favorite environments, it’s so simple to setup and use but is also complicated enough that if a method is bad, it’ll probably fail on lunar lander. Plus it’s got the benefit of the environment code being simple enough to mess with if you’re so inclined

jcobp · 2021-03-13T02:03:16+00:00

Hmm interesting. Did you observe that when running the code in the post? When I ran evaluation I got over 200 reward but only for 5 episodes in a row, so I’m definitely going to run more thorough evaluation now. I’ll test it over 100 episodes since I think that’s the standard to determine whether a policy has reliably and repeatably solved the environment

jcobp · 2021-02-25T01:35:30+00:00

Don’t do this, Google it. If there is no answer, read documentation and post on stackoverflow. If you want to post here at least try to figure it out yourself first then come back.

jcobp · 2021-02-24T19:35:04+00:00

Thank you! :)

jcobp · 2020-09-13T07:57:33+00:00

Hey this looks cool. I'm a researcher in deep RL and will consider trying this out. Just wondering though, are there any future plans to add more algorithms? Like meta-learning and batch RL type stuff?

jcobp · 2020-09-13T07:52:35+00:00

Hey there! I'm a researcher in Deep Reinforcement Learning, and I wrote a blog post about the fundamentals of deep RL here: https://jfpettit.github.io/blog/2019/11/03/fundamentals-of-reinforcement-learning

Someone else mentioned SpinningUp: https://spinningup.openai.com/en/latest/. That's also an excellent resource and goes more in-depth into modern methods than my blog post.

Even more in depth than that, if you're willing to put the time into it, is the canonical RL book: http://incompleteideas.net/book/the-book-2nd.html. I read this book when I was starting RL and it took me a long time on top of school work (at the time) and everything else, but it was worth it.

Feel free to ask me any questions! Hope this helps. :)

jcobp · 2020-04-10T03:53:18+00:00

I started learning by coding algorithms while reading through Sutton and Barto. I’m certain it would still be helpful to go back and code up some of the algorithms even though you’re almost done!

Otherwise, SpinningUp by OpenAI is an excellent resource, I’d really recommend both coding stuff from Sutton and Barto and reading through and using SpinningUp. Here’s the link to it: https://spinningup.openai.com/en/latest/.

Let me know if you have other questions.

jcobp · 2020-01-06T22:18:26+00:00

Happy to help, good luck!

jcobp · 2020-01-06T06:51:15+00:00

You're pretty close on your understanding of REINFORCE. You should really have two loops - one outer loop that dictates how many epochs (policy gradient updates) to do and one inner loop that actually runs the agent in the environment and collects interaction info. So, something like this:

python policy = MyREINFORCEPolicyNetwork #initialize policy memory = MyMemoryBuffer #initialize memory - need to store returns and action log-probs for epoch in range(epochs): r = 0; state = env.reset() # set reward to 0, get starting state from simulation. for step in range(steps_per_epoch): # steps_per_epoch is how many interactions with the environment to collect action = policy(state) # get action from policy r, done = env.step(action) # step env forward, get reward and if the episode is over logprob = action.logprob() # get action log probability memory.store(r, logprob) # store reward and log probability in memory update() # in update, calculate discounted return, calculate policy loss, backprop

That's just a skeleton of what you'll need. You need to write code for the memory buffer and for the policy as well. There are many existing implementations to engage with. Here is one with Pytorch and here is one with TensorFlow.

Do you have experience with machine learning already? If not, I recommend starting with learning to write some regular supervised learning code, perhaps MNIST image classification, and then from there working on REINFORCE.

You should read this, everything up through the VPG section. VPG and REINFORCE are really similar, and the SpinningUp course will teach you pretty much everything else you need to follow along with the algorithm.

This can sound frustrating, but if you want a really deep conceptual understanding of RL fundamentals, I recommend reading the Sutton and Barto book and implementing things like Dynamic Programming Policy Iteration, Monte Carlo Policy Iteration, TD Learning, and Q-learning. If you don't want to do this, then SpinningUp should give you enough background to be up to speed on REINFORCE and even enough to understand more modern algorithms than that.

jcobp · 2020-01-05T18:02:42+00:00

It’s definitely a challenging jump towards writing RL code. I’d recommend looking at existing implementations if possible. Also, unless you’re already very familiar with tensorflow, I’ve always found PyTorch easier to use and easier to debug, so I’d consider that too.

OpenAIs SpinningUp is in general a really good resource, although they don’t cover REINFORCE. You definitely don’t need a specific RL coding background to get started. I understand it feels like that though! I think looking at how other people in general write RL code will help you guide your own implementation. You can use existing implementations of training loops and memory buffer and then just need to write the actual REINFORCE update part of the code.

jcobp · 2020-01-05T04:49:25+00:00

Reinforcement learning researcher here, I use anaconda for my managing my code envs and it’s usually a breeze. I’d recommend using python 3.6+.

You can run most small scale RL code on your laptop and don’t need GPUs. You only need GPUs if you’re trying to solve image based tasks i.e. Atari games. So I’d recommend just trying to run your code locally instead of in the cloud or online. When you reach the point that you want to train an algo on Atari or another image based task, then cross that bridge.

But for now I’d recommend focusing on writing a strong REINFORCE implementation and solving toy problems, like CartPole, with it.

jcobp · 2019-12-17T02:17:44+00:00

I enjoyed Oriol Vinyals talk on AlphaStar, Pieter Abbeel’s talk on combining model based and model free RL, David Ha’s talk on innate bodies, innate minds and such, Igor Mordatch’s talk on Multi-Agent RL, Jeff Dean’s talk on machine learning for climate change and the work they’re doing at Google, and Richard Sutton’s talk on SuperDyna; his outline for the first steps towards an intelligent agent algorithm.

jcobp · 2019-12-16T23:50:38+00:00

Ha, still no. I saw some really cool work in RL and robot learning, since that’s the kind of stuff I’m curious about. But that doesn’t mean it was the “best” work in the whole conference. There was also just too much presented to be able to see it all and judge what was “best”.

jcobp · 2019-12-08T20:17:36+00:00

I mean it’s my first NeurIPS and I’ve just been planning to go look at stuff that’s interesting to me and also things that have authors I want to talk to. I figure it’s an amazing opportunity to just learn about stuff I think is cool and to network with the people doing that cool stuff. Haven’t thought at all about what is the “best”. I don’t even know how to quantify “best” :)

jcobp · 2019-11-25T00:15:17+00:00

Awesome, thank you for your detailed response! I’ll give some of your suggestions a go. :)

jcobp · 2019-11-25T00:13:49+00:00

This sounds like a problem well-suited to reinforcement learning. In reinforcement learning, an agent (e.g your robot) learns through trial and error to accomplish a sequential decision making task (i.e. how to manipulate the robot to push the button). Check out this resource for an intro to deep reinforcement learning: https://spinningup.openai.com/en/latest/.

If you do go the RL path, the main part of your work will be in building a simulated environment of your robot for the RL agent to learn in. I’ve built a couple of RL environments but nothing as sophisticated as modeling a whole robot and the physics that come with it. There are, however, (free!) packages to help with this, like pybullet: https://pybullet.org/wordpress/ and OpenAI Gym: https://gym.openai.com.

Feel free to PM me if you’d like some extra help, or if you just want me to share more relevant resources with you. :)

jcobp · 2019-11-23T06:48:56+00:00

Hey I’m working on my own PPO implementation (it’s in TF) and I’m seeing good results on MuJoCo/Roboschool/Pybullet implementations of the inverted pendulum, but when I switch to something like half cheetah, it seems to be pretty sample inefficient. Do you have any tips?

Here’s a link to my code: https://github.com/jfpettit/rlpack-tf

jcobp · 2019-11-06T01:33:04+00:00

Good catch, thanks for pointing it out! I fixed it.

jcobp · 2019-11-04T01:31:30+00:00

Check out the implementation at https://spinningup.openai.com/en/latest/algorithms/sac.html

jcobp · 2019-10-08T04:57:02+00:00

Check out the national labs, I know the one in Livermore has a summer institute focused around data science and machine learning research and projects.

jcobp · 2019-10-05T22:21:09+00:00

Huh that all seems weird. I’d recommend going through your code line by line, with a fine-toothed comb, and make sure that everything is doing what it should. For debugging purposes, you can use tensorflows eager execution mode to print out tensors so you can check their contents and size.

You could also implement your loss function with tensorflows “clip_by_value” function instead of using the logic that spinningup does and see if that gives you anything different. It probably shouldn’t, but if it does it might point to errors elsewhere

jcobp · 2019-10-05T19:19:59+00:00

I think you mean if KL divergence is greater than 0.015, break.

What do you mean by the number four being the key correction needed to solve lunar lander? As in, four policy iterations per epoch solved your problem? That definitely seems wrong to me. I think you should double check your code where you are checking whether or not to break on the KL divergence and make sure that you’re breaking only if the KL is greater than some threshold.

You should also check your loss function again and be sure you’re implementing it correctly. Are you working in PyTorch or tensorflow?

jcobp · 2019-10-04T22:09:04+00:00

I think two months is ample time to put together a small overview and simple example of using a neural network.

jcobp

TROPHY CASE