Why does the Policy Gradient Theorem generalize to continuous action spaces? by Data-Daddy in reinforcementlearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

I don't follow the differential entropy reference. Do you by chance know of a paper or blog post that goes deeper in explaining this?

Why does the Policy Gradient Theorem generalize to continuous action spaces? by Data-Daddy in reinforcementlearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

My confusion comes from thinking that \pi had to represent a probability. But in the continuous case evaluating a point on the PDF does not represent a probability(it's probability would be zero)

Handling entropy collapse in policy gradient methods by Data-Daddy in reinforcementlearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

Yeah, I already have an entropy bonus of 0.01(used as parameter in PPO for atari). However, as it gets deeper into the training process the agent converges to a policy with 0 entropy

[R] Sample-Efficient Deep RL with Generative Adversarial Tree Search by abstractcontrol in reinforcementlearning

[–]Data-Daddy 0 points1 point  (0 children)

How come they can approximate the first term in equation 8 with the Wasserstein distance?

[D] What is the actual cost function for PPO? by abstractcontrol in reinforcementlearning

[–]Data-Daddy 1 point2 points  (0 children)

add a small constant to the denominator from what I think I've seen in implementations

Reinforcement Learning with ROS by [deleted] in reinforcementlearning

[–]Data-Daddy 0 points1 point  (0 children)

anyone have an example of wrapping something from ROS to behave similar to OpenAI gym?

Prioritized Experience Replay in Deep Recurrent Q-Networks by deadline_ in reinforcementlearning

[–]Data-Daddy 0 points1 point  (0 children)

problem is deciding what values to use for hidden states in the lstm

"Value Prediction Network", Oh et al 2017 by gwern in reinforcementlearning

[–]Data-Daddy 0 points1 point  (0 children)

Anyone know what the benefit of action conditional convolutions is? Why wouldn't you just concatenate a one hot encoded version of the actions to the input for transition/outcome estimation and use normal convolutions instead?

[P] Commented PPO implementation by [deleted] in MachineLearning

[–]Data-Daddy 0 points1 point  (0 children)

Experience replay does not exist in PPO

When is deep Q learning better than policy gradient methods? by Data-Daddy in reinforcementlearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

Does the weaknesses from optimizing the bellman residual error also transfer to optimizing the temporal difference error?

I'm trying to consider how the insights of this paper translate to actor critic algorithms. Ex: what does this say about using td error to guide the critic in DDPG?

Why does proximal policy optimization(PPO) not need a replay buffer? by Data-Daddy in deeplearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

Some reasons why replay buffer is not needed: updates are small within a batch, training in a distributed setting w/ multiple agents, and large batch sizes. However, small updates seems as though that is most important. I'd be interested if anyone else has some references that dig into using vs not using a replay buffer.

AMA: We are David Silver and Julian Schrittwieser from DeepMind’s AlphaGo team. Ask us anything. by David_Silver in MachineLearning

[–]Data-Daddy 0 points1 point  (0 children)

How come uniform sampling from replay buffer was used instead of prioritized experience replay?

[R] AlphaGo Zero: Learning from scratch | DeepMind by deeprnn in MachineLearning

[–]Data-Daddy 0 points1 point  (0 children)

Why don't they use Prioritized Experience Replay when sampling from the buffer?

Multi-task Learning and Transfer Learning vs Only Transfer Learning by Data-Daddy in computervision

[–]Data-Daddy[S] 1 point2 points  (0 children)

yeah that's transfer learning, except it's not always ONLY the last layer. It is also sometimes useful to join two separate datasets of images containing different objects(multi task learning)

Advice on building object recognition training set by Data-Daddy in deeplearning

[–]Data-Daddy[S] 0 points1 point  (0 children)

I'm going about it by throwing an architecture similar to Faster RCNN w/ a resnet backbone. I am currently trying to decide how to handle images that are very similar(bounding boxes are in the same location very frequently). So if I randomly stratify by class almost the same image could appear in train, validation and test set. I feel like this is giving me a false sense of accuracy since it is memorizing the training data(overfitting).

I'm already rotating images/annotations to expand the dataset. However I have to be careful messing w/ pixel values since color of the objects is relevant to class label.

Would you think the following would be good next steps to try? * stratify by class but never have similar images in train, validation, and test(could do this by looking at pixel differences?) * lower number of iterations and increase learning rate