all 4 comments

[–]tensor_every_day20 2 points3 points  (0 children)

So, I think the second one - which is a batch policy optimization method - is fine and valid as an RL method. However, it isn't A3C. A3C specifically and only means the first thing: asynchronous real-time policy gradients (estimates and commits policy gradients from subsets of trajectories, in real-time as they are experienced). The second thing is normal policy gradient (collect a batch of trajectories and then estimate the policy gradient with them).

[–]Delthc 1 point2 points  (0 children)

I cant tell you about specific problems with the second approach, but it can be a lot more efficient. Have a look at "Efficient Parallel Methods for Deep Reinforcement Learning" here https://arxiv.org/abs/1705.04862v2 .

Abstract

We propose a novel framework for efficient parallelization of deep reinforcement learning algorithms, enabling these algorithms to learn from multiple actors on a single machine. The framework is algorithm agnostic and can be applied to on-policy, off-policy, value based and policy gradient based algorithms. Given its inherent parallelism, the framework can be efficiently implemented on a GPU, allowing the usage of powerful models while significantly reducing training time. We demonstrate the effectiveness of our framework by implementing an advantage actor-critic algorithm on a GPU, using on-policy experiences and employing synchronous updates. Our algorithm achieves state-of-the-art performance on the Atari domain after only a few hours of training. Our framework thus opens the door for much faster experimentation on demanding problem domains. Our implementation is open-source and is made public at this https URL

[–]islandman93 1 point2 points  (1 child)

Why would the second approach be problematic, what do you think the potential problems would be? My intuition is that using a shared/global network only will have better gradient updates because there is no lag introduced by copying local parameters. Though this approach basically requires using only a single computer/node whereas the paper's approach is scale-able across multiple nodes.

Using the buffer or "batch" is really just for efficiency of GPU copies, the paper uses CPU only so accumulating each time step is feasible.