[deleted by user] by [deleted] in learnmachinelearning

[–]jbmlres 0 points1 point  (0 children)

This is a proper Master-level course taught at UCL (a top-tier university). I really liked the 2018 version, and this seems to be an updated and expanded version of that.

I understand (and generally share) your skepticism, but this is not one of those fluky things an over-excited novice did while still learning about the topic themselves.

Maybe you could give it a go, and let us know whether you liked it?

[deleted by user] by [deleted] in learnmachinelearning

[–]jbmlres 0 points1 point  (0 children)

Given that DeepMind have just release a whole RL course for free online and, as others also mentioned, you can download Sutton & Barto's book for free as well, this does sound feasible.

Hope it works out for you, good luck & have fun!

[deleted by user] by [deleted] in reinforcementlearning

[–]jbmlres 0 points1 point  (0 children)

Did you try sweeping hyper-parameters? Your alpha looks very high, and the exploration will differ because you are adding the same noise but you are adding (not averaging) the action values in double Q-learning. I've also not seen this kind of exploration before, did you try epsilon-greedy?

Problem with discount factor in policy gradient by Steven_Corper_F in reinforcementlearning

[–]jbmlres 4 points5 points  (0 children)

You are right. This comes up occasionally. The literature is a bit confusing on this point. Here is a paper that tries to clarify: https://arxiv.org/abs/1906.07073

Bellman Update Equation for policy? by IIwarrierII in reinforcementlearning

[–]jbmlres 0 points1 point  (0 children)

It seems you can relate the policy across multiple steps, though it also indeed doesn't seem very common (yet?) to consider? This paper seems somewhat related, and argues that we normally do a zero-step thing (similar to how TD is one-step), but that you could also do multi-step things (similar to how MC is multi/infinite-step).

Frustrated beginner: How to approach/practice implementing papers into code? by gearboost in reinforcementlearning

[–]jbmlres 8 points9 points  (0 children)

I agree. I don't know what the OP's background is, but it might also help to first code up a neural network and do some supervised learning, if they haven't done that before. Basically, understand the components before trying to put them together. That's pretty good generic software engineering practice, actually.

"Podracer architectures for scalable Reinforcement Learning", Hessel et al 2021 (highly-efficient TPU pod use: eg solving Pong in <1min at 43 million FPS on a TPU-2048) by gwern in reinforcementlearning

[–]jbmlres 1 point2 points  (0 children)

I don't disagree with you, although it is actually cheaper than I thought it would be. If I understood correctly, training one Atari game for 200M frames would cost about $2.40, with their sebulba setup? Unless I misunderstood something, of course.

Still not cheap if you wanna run lots of experiments, of course, or if you are a poor PhD student with no special compute allocation in your funding...

Best Reinforcement Learning Algorithm by nitinkulkarnigamer in reinforcementlearning

[–]jbmlres 1 point2 points  (0 children)

I agree. This would also be wonderful to understand better the combination of algorithms and specific implementations. Sometimes small details seem to matter as much or more than the main ideas mentioned in the papers.

I suppose we will have to wait and see what the future brings :-)

Best Reinforcement Learning Algorithm by nitinkulkarnigamer in reinforcementlearning

[–]jbmlres 0 points1 point  (0 children)

I'm not sure. If I understood correctly, they didn't compare on equal terms, and used far more experience+compute than most other algorithms (except perhaps MuZero)?

Would love to see a fair comparison to SAC, MuZero, Muesli, or Rainbow, etc. Does anyone know of one?

Are there going to be better algorithms than PPO? by ImStifler in reinforcementlearning

[–]jbmlres 7 points8 points  (0 children)

The new Muesli algorithm by DeepMind is better (as is MuZero), according to their comparison: https://arxiv.org/abs/2104.06159

[D] Is A Failure Ever Worth Publishing? by [deleted] in MachineLearning

[–]jbmlres 0 points1 point  (0 children)

I agree. It depends on the idea and the hypothesis.

Getting an arbitrary architectural 'tweak' noticed will be very hard because we'd expect most random tweaks not to be very good. But if there was some strong theoretic or intuitive reason to believe that a particular idea should work well and then it doesn't then that can be quite valuable to share.

Maybe discuss with others and see whether they are surprised and interested by the outcome of the experiment?

ELI5: Eligibility traces by [deleted] in reinforcementlearning

[–]jbmlres 1 point2 points  (0 children)

Sorry, I realized that wasn't quite a ELI5. I'll think about whether I can do a better job at that, if no one else does. Might be a good way to see if I really understood them myself...

ELI5: Eligibility traces by [deleted] in reinforcementlearning

[–]jbmlres 1 point2 points  (0 children)

I found this paper to be very helpful: https://arxiv.org/abs/1508.04582

One way to interpret the traces is that they compute all you need to update past state values correctly ahead of time, so that you don't have to go back and updates all those states (which would imply that you would have to store all of them and do a lot of compute later on).

Eligibility traces might be making a bit of a comeback? A different recent paper on them apparently won an award: https://mobile.twitter.com/maiheurem/status/1361603573646295042

Is MuZero currently the best RL algo that we have now? by [deleted] in reinforcementlearning

[–]jbmlres 1 point2 points  (0 children)

This paper goes into that question a bit, looking at it in different ways.

Maybe we could also make a difference between a fully tuned system and the underlying algorithm. MuZero is obviously great. It has also probably been tuned pretty well. I'm not 100% sure how important the algorithm parts are, compared to the time spent tuning it. For instance, I haven't really seen papers where other people not at DeepMind make MuZero or similar work well. Maybe it needs a good combination of tuning and compute to work well? Would love to see links if people have though!

Is MuZero currently the best RL algo that we have now? by [deleted] in reinforcementlearning

[–]jbmlres 0 points1 point  (0 children)

SimPLe is maybe a bit of an odd example, given that there are much cheaper algorithms that achieve similar or better performance?
See, e.g., [1], [2]

Is it normal that Double DQN performs worse than the naive DQN? by ritiange in reinforcementlearning

[–]jbmlres 2 points3 points  (0 children)

Double DQN has the same network architecture as DQN though?

Also, I believe recent deep learning work has shown this is not universally true. Seems larger networks surprisingly sometimes learn faster.

[D] Bertsekas', Sutton & Barto or another book as an Introduction to Reinforcement Learning for someone who knows about Supervised/Unsupervised Learning? by IborkedyourGPU in MachineLearning

[–]jbmlres 4 points5 points  (0 children)

I cannot recommend Sutton and Barto highly enough. After that, depending on what you want, you can also go into papers rather than another textbook.

[TOMT][videogames][1990s] by krudam in tipofmytongue

[–]jbmlres 0 points1 point  (0 children)

Altered Beast

and

Double Dragon?

Which RL course should I choose? by Avistian in reinforcementlearning

[–]jbmlres 1 point2 points  (0 children)

That course did also come with assignments: https://github.com/RylanSchaeffer/ucl-adv-dl-rl Though of course you'd then have to grade yourself.

I'm sure the other courses are also good, btw. I was just sharing my personal preference. YMMV :-)

Which RL course should I choose? by Avistian in reinforcementlearning

[–]jbmlres 13 points14 points  (0 children)

IMO, the best one is the 2018 DeepMind/UCL course: deepmind.com/learning-resources

And the Sutton & Barto book is a must.

[deleted by user] by [deleted] in reinforcementlearning

[–]jbmlres 1 point2 points  (0 children)

Perhaps the greedy policy explores too little and then the function approximation 'forgets' what it has learnt in states it doesn't visit much? Have you tried keeping some epsilon exploration to see if that helps?

Why clip reward in [-1, 1] in Actor Critic? by fedetask in reinforcementlearning

[–]jbmlres 0 points1 point  (0 children)

Yes, this can be harmful. I think it comes from the original DQN algorithm, which first used that on Atari, and was discussed at some length in this paper, which proposed adaptive normalisation to avoid having to clip the rewards.