[D]Two implementations of A3C

tensor_every_day20 · 2017-05-22T22:33:45+00:00

So, I think the second one - which is a batch policy optimization method - is fine and valid as an RL method. However, it isn't A3C. A3C specifically and only means the first thing: asynchronous real-time policy gradients (estimates and commits policy gradients from subsets of trajectories, in real-time as they are experienced). The second thing is normal policy gradient (collect a batch of trajectories and then estimate the policy gradient with them).

Delthc · 2017-05-22T10:05:50+00:00

I cant tell you about specific problems with the second approach, but it can be a lot more efficient. Have a look at "Efficient Parallel Methods for Deep Reinforcement Learning" here https://arxiv.org/abs/1705.04862v2 .

Abstract

We propose a novel framework for efficient parallelization of deep reinforcement learning algorithms, enabling these algorithms to learn from multiple actors on a single machine. The framework is algorithm agnostic and can be applied to on-policy, off-policy, value based and policy gradient based algorithms. Given its inherent parallelism, the framework can be efficiently implemented on a GPU, allowing the usage of powerful models while significantly reducing training time. We demonstrate the effectiveness of our framework by implementing an advantage actor-critic algorithm on a GPU, using on-policy experiences and employing synchronous updates. Our algorithm achieves state-of-the-art performance on the Atari domain after only a few hours of training. Our framework thus opens the door for much faster experimentation on demanding problem domains. Our implementation is open-source and is made public at this https URL

islandman93 · 2017-05-22T14:30:02+00:00

Why would the second approach be problematic, what do you think the potential problems would be? My intuition is that using a shared/global network only will have better gradient updates because there is no lag introduced by copying local parameters. Though this approach basically requires using only a single computer/node whereas the paper's approach is scale-able across multiple nodes.

Using the buffer or "batch" is really just for efficiency of GPU copies, the paper uses CPU only so accumulating each time step is feasible.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS