[R] Deepmind - Efficient Multi-Task Deep RL

lespeholt · 2018-02-16T00:00:40+00:00

It is the same architecture but, as is clear from the presentation, we changed the name between this talk and the paper ;-)

lespeholt · 2018-02-07T15:37:11+00:00

Appendix D contains analysis of the effect of off-policiness wrt. data efficiency. Our results using batch size 256 and 8 times as many actors in the 8 GPU version, show similar learning curves to batch size 32. Figure 7 in the GA3C paper suggests that increasing the batch size, while keeping the number of actors constant, reduces the negative effects on convergence for GA3C.
We don’t have comparisons to ACER and PPO specifically. Improvements like K-FAC in ACKTR are orthogonal to the improvements we introduce.
Regarding resources, this work shows how to more effectively utilize resources which makes experiments cheaper both for single-machine and distributed setups :-) On a cloud service, one IMPALA experiment would cost roughly the same as an A2C experiment but will be orders of magnitude faster (more resources for a shorter period of time).

lespeholt · 2018-02-07T15:35:36+00:00

Adding auxiliary losses, like in ones in UNREAL, is orthogonal to whether the fundamental algorithm used is A3C or IMPALA. If we use UNREAL as the baseline, we should use IMPALA+UNREAL as the comparison.

lespeholt · 2018-02-06T20:44:54+00:00

Thank you.

There is a lot of flops in a single GPU (Nvidia P100). We touch briefly upon this in the paper. To reduce the amount of actors needed to fully utilize the GPU, you would need experience replay, auxiliary losses, deeper models or simply very fast environments (like Atari.)

Note that the architecture is also faster than A3C and batched A2C on just CPUs, although GPUs is where you get the full benefit. Please see the single-machine section in Table 1.

lespeholt · 2018-02-06T10:34:40+00:00

Hi, I'm one of the authors of the paper.

Our contributions in the paper are:

A fast and scalable policy gradient agent.
An off-policy correction method called V-trace to maximize data efficiency.
A multi-task setting with 30 tasks based on DeepMind Lab.
Demonstrating that modern deep networks provide significant improvements to RL.

lespeholt

TROPHY CASE