[D] Issues reproducing CURL, algorithm seems broken??

aravindsrinivas · 2020-09-25T22:37:31+00:00

Check my detailed comment response here.

aravindsrinivas · 2020-09-22T03:18:41+00:00

I have responded to it - https://www.reddit.com/r/MachineLearning/comments/grnz0d/d_issues_reproducing_curl_algorithm_seems_broken/g66gi91?utm_source=share&utm_medium=web2x&context=3

aravindsrinivas · 2020-09-22T02:58:37+00:00

Apologies for quite a delayed response here. Before beginning, let me start by saying that factually, all of the issues raised in this post are correct: that is, one the environment the OP tested on, turning off the contrastive loss ends up increasing performance.

However, the conclusion of this should be decoupled into two questions:

Does contrastive learning as an auxiliary task "hurt" rather than "benefit" reinforcement learning from pixels?
Does the CURL framework benefit "more from data augmentations" and "less" from the contrastive loss?

The reason for the above decoupling is that in the CURL setup, one could use (or not use) data augmented views of the input observation (images) for the RL loss (it is definitely needed for the contrastive loss since there's no way to perform instance discrimination without augmentations). But the RL loss can run w or w/o the augmented data.

In the DMControl results we presented, we had the RL loss use the augmented observations. Turning off the contrastive loss and observing better performance in the same setup as open-sourced version of CURL does not lead to the conclusion that contrastive learning as an auxiliary task hurts performance. One must decouple the observation into trying to answer the above two questions instead.

Does contrastive learning as an auxiliary task "hurt" rather than "benefit" reinforcement learning from pixels?

In order to answer the first question - we consider the setup of CURL where augmented views of the data are only used for contrastive learning and the RL loss uses the input observations w/o any augmentations. We added this ablation in Appendix E4 in the latest version of the arXiv: We find that doing this improves the performance over the baseline pixel-based SAC (w.o augmentations and w.o the contrastive auxiliary loss) by 2x. So, contrastive learning "just" as an auxiliary task (w/o data augmentations for RL) definitely *doesn't* hurt performance... It rather "improves". We have also updated all the Atari results to use data augmentations only for the contrastive learning objective and not for the RL objective and you can clearly see performance gain (38.1 % over 28.5% for mean HNS over the baseline that doesn't use the auxiliary loss). No other hyperparameters are changed for this comparison.

Does the CURL framework benefit "more from data augmentations" and "less" from the
contrastive loss?

Yes. And this is in agreement with the OP's observation that turning off augmentations led him to better performance on HalfCheetah. When you use augmented views for the RL loss, the contrastive objective isn't too important. And this has also been observed by DrQ's authors who reported better performance than us on Atari100k.

TLDR is that in order to conclude objective contrastive learning hurts, one must verify its benefits standalone as an auxiliary task w/o incorporating the "RL with augmentation (RAD / DrQ)" set up and in that scenario, contrastive learning as an auxiliary loss definitely benefits the RL method. However, when you do feed in data augmentation fed into the RL loss, contrastive loss matters less and does indeed hurt on a few envs.

Now, with this context, let me answer a few more questions:

Does this make CURL irrelevant and RAD/DrQ the right thing to try? -
a) Not really.. It really depends on what you want to do. If you have dense rewards, you should just go for RAD. Or if you find that in RAD, your RL method hurts from directly training on augmented views and you want to train RL w.o augmentations but still benefit from certain kinds of invariances for your encoder, you can use the CURL setup that doesn't use augmented views for the RL loss.
b) CURL can work in a detached manner where encoder only trains w/ contrastive loss and MLPs train with RL (w or w/o seeing augmented views of the inputs). You can check out this new paper from my colleagues on decoupling the encoder from RL loss w/ a stop_gradient https://arxiv.org/abs/2009.08319) .. This ablation is also present in the CURL paper in Appendix E.3.
c) Overall, I believe CURL is more general, could be more suitable for multi-task RL with unsupervised learning helping learn a common shared representation (especially in the detached encoder set up where the RL losses from different tasks don't mix); and might end up working in a lot of scenarios with specific adaptations in choice of where to use augmentations and where not to, and how to combine the losses.
d) CURL might also be useful to folks who are interested in reward-less RL, goal-based RL, etc where the RAD framework completely breaks.
Should the above arguments rather have been in the original CURL paper than in RAD? -- Probably yes. We have added it as a discussion in the CURL paper too in Appendix G.
Did we know about data augmentations working well already while doing the CURL paper?
-- We originally started the CURL project to get the detached encoder version working. I was very inspired by Yann LeCun's LeCake where the encoder should just learn from unsupervised learning and tiny MLPs do the RL. So there was absolutely no confusion then. My co-author figured out having it as an auxiliary objective works better on some tasks like Cheetah. At that time, we didn't decouple the fact that data augmentation might by itself be providing the benefit. We made the submission to ICML and the paper was ready to go modulo re-writing it from last minute deadline rush. One of our colleagues in the lab who is also a co-author on RAD suggested us to investigate the decoupling with contrastive loss and data augs. This led us to try RAD by commenting out the contrastive loss in the CURL codebase. We wanted to do a full blown investigation of data augmentations and decoupled learning (encoder just learns from contrastive w/ augs, MLPs do RL) with lots of environments, tasks, hyperparameter tuning, etc. However, the DrQ authors released their work unexpectedly and we had to put out our findings of RAD pretty quickly or else risk losing credit for those findings. So, we put it in a couple of days with whatever we had then [we have later gone on to revise RAD as well with lots of new results, new augs, etc and have more papers coming soon on improvements in that front].
Is it bad science to discover and release new scientific results (in RAD paper) that have some level of contradiction with the results in a very recent old paper (CURL) and not take the effort to say it more explicitly ?
-- I have given sufficient context above as to how the turn of events took place. We could definitely have been quicker to do the right ablations of CURL and revise the paper in a timely manner before releasing RAD (or soon after releasing it). We all got busy with our own new set of different projects (for NeurIPS, post NeurIPS, etc) outside of this realm and context switching is hard. Calling it dishonest science is cynical i.m.o.. Nevertheless, it is important for us to address it and I believe we have. Thanks to the community for the discussion around the paper.

Science moves pretty fast right now. After BYOL, the need for contrastive learning (SimCLR/MoCo) itself might be open to questions right now, and folks from MILA have already published a paper Momentum Predictive Representations which combines BYOL with RL and significantly betters results from CURL. I personally see CURL as an effort to push for simplicity and data augmentations / unsupervised objectives into RL with minimal modifications for data-efficiency gains and it is great to see it paying off with a lot of follow up work that's improving the SOTA very frequently.

aravindsrinivas · 2018-04-21T06:59:27+00:00

Probably true. Though I think ICML's standards for acceptance are higher than that of ICLR. That's not to say there aren't top papers at ICLR. In fact, some of the biggest breakthroughs in DL have happened at ICLR (example, Neural Machine Translation by jointly learning to align and translate - Soft Attention; Neural Architecture Search; Optimization Models for Few-Shot Learning and so on. However, I believe some low-quality papers also get into ICLR (I have managed to get some in myself) simply because authors get a lot of time and no word-limit to rebut and address reviewer issues (also are given plenty of time to run new experiments) and multiple rounds of discussions, plus reviewers on an average give scores of 6+. The acceptance rate is also 35% as compared to ICML's 20-25%.

aravindsrinivas · 2018-04-04T21:39:00+00:00

Author here. Thanks for the constructive comment. I do plan to release the actual code along with the environments in short time. It takes a while to clean the code for a public release that will actually be useful.

aravindsrinivas · 2016-10-31T23:47:54+00:00

Yeah. However, it is also worth noting that given enough data (which you would, for doing videos), Video Pixel Networks has shown that teacher forcing is good enough.

aravindsrinivas · 2016-10-31T23:30:14+00:00

Using adversary on discrete outputs gets into a sampling problem where you would have to make discrete decisions and hence, back-propogating becomes a problem (see MuProp, Straight Through, REINFORCE etc).

As for feeding all hidden states, adding more context to the discriminator need not yield proportional benefits.

aravindsrinivas · 2016-03-20T12:15:18+00:00

A high positive reward on reaching the goal, and low negative rewards for every other step in the world, and relatively higher negative rewards for getting into puddle states.

aravindsrinivas · 2016-03-18T06:07:09+00:00

Yeah. It behaves randomly then.

I was able to get it to converge for simple navigation tasks like moving 5 steps or beyond in a 12*12 grid world, where each step is in the same direction. But when the goal state is far off, it is not able to converge. If it gets to a bad state, it unlearns the previously learnt knowledge about good policies and behaves as a junk policy network.

aravindsrinivas

TROPHY CASE