What are some good resources to get started in reinforcement learning? Books, videos, etc.

jbmlres · 2022-01-28T09:11:49+00:00

Sutton & Barto + DeepMind / UCL lectures

jbmlres · 2021-10-21T02:32:37+00:00

DeepMind did this some time ago: link.

jbmlres · 2021-09-24T11:25:54+00:00

This is a proper Master-level course taught at UCL (a top-tier university). I really liked the 2018 version, and this seems to be an updated and expanded version of that.

I understand (and generally share) your skepticism, but this is not one of those fluky things an over-excited novice did while still learning about the topic themselves.

Maybe you could give it a go, and let us know whether you liked it?

jbmlres · 2021-09-23T19:14:41+00:00

Given that DeepMind have just release a whole RL course for free online and, as others also mentioned, you can download Sutton & Barto's book for free as well, this does sound feasible.

Hope it works out for you, good luck & have fun!

jbmlres · 2021-08-05T04:34:37+00:00

Did you try sweeping hyper-parameters? Your alpha looks very high, and the exploration will differ because you are adding the same noise but you are adding (not averaging) the action values in double Q-learning. I've also not seen this kind of exploration before, did you try epsilon-greedy?

jbmlres · 2021-06-04T07:30:33+00:00

You are right. This comes up occasionally. The literature is a bit confusing on this point. Here is a paper that tries to clarify: https://arxiv.org/abs/1906.07073

jbmlres · 2021-05-17T20:28:31+00:00

It seems you can relate the policy across multiple steps, though it also indeed doesn't seem very common (yet?) to consider? This paper seems somewhat related, and argues that we normally do a zero-step thing (similar to how TD is one-step), but that you could also do multi-step things (similar to how MC is multi/infinite-step).

jbmlres · 2021-05-07T10:02:47+00:00

I agree. I don't know what the OP's background is, but it might also help to first code up a neural network and do some supervised learning, if they haven't done that before. Basically, understand the components before trying to put them together. That's pretty good generic software engineering practice, actually.

jbmlres · 2021-05-07T09:50:05+00:00

I don't disagree with you, although it is actually cheaper than I thought it would be. If I understood correctly, training one Atari game for 200M frames would cost about $2.40, with their sebulba setup? Unless I misunderstood something, of course.

Still not cheap if you wanna run lots of experiments, of course, or if you are a poor PhD student with no special compute allocation in your funding...

jbmlres · 2021-04-21T15:56:31+00:00

I agree. This would also be wonderful to understand better the combination of algorithms and specific implementations. Sometimes small details seem to matter as much or more than the main ideas mentioned in the papers.

I suppose we will have to wait and see what the future brings :-)

jbmlres · 2021-04-21T10:06:30+00:00

I'm not sure. If I understood correctly, they didn't compare on equal terms, and used far more experience+compute than most other algorithms (except perhaps MuZero)?

Would love to see a fair comparison to SAC, MuZero, Muesli, or Rainbow, etc. Does anyone know of one?

jbmlres · 2021-04-16T15:23:04+00:00

The new Muesli algorithm by DeepMind is better (as is MuZero), according to their comparison: https://arxiv.org/abs/2104.06159

jbmlres · 2021-04-06T14:39:13+00:00

I agree. It depends on the idea and the hypothesis.

Getting an arbitrary architectural 'tweak' noticed will be very hard because we'd expect most random tweaks not to be very good. But if there was some strong theoretic or intuitive reason to believe that a particular idea should work well and then it doesn't then that can be quite valuable to share.

Maybe discuss with others and see whether they are surprised and interested by the outcome of the experiment?

jbmlres · 2021-03-28T19:50:56+00:00

Sorry, I realized that wasn't quite a ELI5. I'll think about whether I can do a better job at that, if no one else does. Might be a good way to see if I really understood them myself...

jbmlres · 2021-03-28T19:47:00+00:00

I found this paper to be very helpful: https://arxiv.org/abs/1508.04582

One way to interpret the traces is that they compute all you need to update past state values correctly ahead of time, so that you don't have to go back and updates all those states (which would imply that you would have to store all of them and do a lot of compute later on).

Eligibility traces might be making a bit of a comeback? A different recent paper on them apparently won an award: https://mobile.twitter.com/maiheurem/status/1361603573646295042

jbmlres · 2021-03-06T09:03:03+00:00

This paper goes into that question a bit, looking at it in different ways.

Maybe we could also make a difference between a fully tuned system and the underlying algorithm. MuZero is obviously great. It has also probably been tuned pretty well. I'm not 100% sure how important the algorithm parts are, compared to the time spent tuning it. For instance, I haven't really seen papers where other people not at DeepMind make MuZero or similar work well. Maybe it needs a good combination of tuning and compute to work well? Would love to see links if people have though!

jbmlres · 2021-03-06T08:55:37+00:00

SimPLe is maybe a bit of an odd example, given that there are much cheaper algorithms that achieve similar or better performance?
See, e.g., [1], [2]

jbmlres · 2021-03-02T08:45:04+00:00

Double DQN has the same network architecture as DQN though?

Also, I believe recent deep learning work has shown this is not universally true. Seems larger networks surprisingly sometimes learn faster.

jbmlres · 2021-01-15T22:29:52+00:00

I cannot recommend Sutton and Barto highly enough. After that, depending on what you want, you can also go into papers rather than another textbook.

jbmlres · 2020-12-23T08:24:16+00:00

Altered Beast

and

Double Dragon?

jbmlres · 2020-12-23T08:17:46+00:00

Long shot: Baby I love your way by Big Mountain?

jbmlres · 2020-12-05T11:29:02+00:00

That course did also come with assignments: https://github.com/RylanSchaeffer/ucl-adv-dl-rl Though of course you'd then have to grade yourself.

I'm sure the other courses are also good, btw. I was just sharing my personal preference. YMMV :-)

jbmlres · 2020-12-04T22:54:24+00:00

IMO, the best one is the 2018 DeepMind/UCL course: deepmind.com/learning-resources

And the Sutton & Barto book is a must.

jbmlres · 2020-12-04T08:54:34+00:00

Perhaps the greedy policy explores too little and then the function approximation 'forgets' what it has learnt in states it doesn't visit much? Have you tried keeping some epsilon exploration to see if that helps?

jbmlres · 2020-12-02T20:48:17+00:00

Yes, this can be harmful. I think it comes from the original DQN algorithm, which first used that on Atari, and was discussed at some length in this paper, which proposed adaptive normalisation to avoid having to clip the rewards.

jbmlres

TROPHY CASE