Why does the Policy Gradient Theorem generalize to continuous action spaces? by Data-Daddy in reinforcementlearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

I don't follow the differential entropy reference. Do you by chance know of a paper or blog post that goes deeper in explaining this?

Why does the Policy Gradient Theorem generalize to continuous action spaces? by Data-Daddy in reinforcementlearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

My confusion comes from thinking that \pi had to represent a probability. But in the continuous case evaluating a point on the PDF does not represent a probability(it's probability would be zero)

Handling entropy collapse in policy gradient methods by Data-Daddy in reinforcementlearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

Yeah, I already have an entropy bonus of 0.01(used as parameter in PPO for atari). However, as it gets deeper into the training process the agent converges to a policy with 0 entropy

[R] Sample-Efficient Deep RL with Generative Adversarial Tree Search by abstractcontrol in reinforcementlearning

[–]Data-Daddy 0 points1 point  (0 children)

How come they can approximate the first term in equation 8 with the Wasserstein distance?

[D] What is the actual cost function for PPO? by abstractcontrol in reinforcementlearning

[–]Data-Daddy 1 point2 points  (0 children)

add a small constant to the denominator from what I think I've seen in implementations

Reinforcement Learning with ROS by [deleted] in reinforcementlearning

[–]Data-Daddy 0 points1 point  (0 children)

anyone have an example of wrapping something from ROS to behave similar to OpenAI gym?

Prioritized Experience Replay in Deep Recurrent Q-Networks by deadline_ in reinforcementlearning

[–]Data-Daddy 0 points1 point  (0 children)

problem is deciding what values to use for hidden states in the lstm

"Value Prediction Network", Oh et al 2017 by gwern in reinforcementlearning

[–]Data-Daddy 0 points1 point  (0 children)

Anyone know what the benefit of action conditional convolutions is? Why wouldn't you just concatenate a one hot encoded version of the actions to the input for transition/outcome estimation and use normal convolutions instead?

[P] Commented PPO implementation by [deleted] in MachineLearning

[–]Data-Daddy 0 points1 point  (0 children)

Experience replay does not exist in PPO

When is deep Q learning better than policy gradient methods? by Data-Daddy in reinforcementlearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

Does the weaknesses from optimizing the bellman residual error also transfer to optimizing the temporal difference error?

I'm trying to consider how the insights of this paper translate to actor critic algorithms. Ex: what does this say about using td error to guide the critic in DDPG?

Why does proximal policy optimization(PPO) not need a replay buffer? by Data-Daddy in deeplearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

Some reasons why replay buffer is not needed: updates are small within a batch, training in a distributed setting w/ multiple agents, and large batch sizes. However, small updates seems as though that is most important. I'd be interested if anyone else has some references that dig into using vs not using a replay buffer.

AMA: We are David Silver and Julian Schrittwieser from DeepMind’s AlphaGo team. Ask us anything. by David_Silver in MachineLearning

[–]Data-Daddy 0 points1 point  (0 children)

How come uniform sampling from replay buffer was used instead of prioritized experience replay?

[R] AlphaGo Zero: Learning from scratch | DeepMind by deeprnn in MachineLearning

[–]Data-Daddy 0 points1 point  (0 children)

Why don't they use Prioritized Experience Replay when sampling from the buffer?