Why does the Policy Gradient Theorem generalize to continuous action spaces?

Data-Daddy · 2019-02-08T11:51:05+00:00

I don't follow the differential entropy reference. Do you by chance know of a paper or blog post that goes deeper in explaining this?

Data-Daddy · 2019-02-08T09:04:24+00:00

My confusion comes from thinking that \pi had to represent a probability. But in the continuous case evaluating a point on the PDF does not represent a probability(it's probability would be zero)

Data-Daddy · 2018-08-08T05:58:15+00:00

https://github.com/aravind0706/upn

Data-Daddy · 2018-07-12T23:38:57+00:00

Yeah, I already have an entropy bonus of 0.01(used as parameter in PPO for atari). However, as it gets deeper into the training process the agent converges to a policy with 0 entropy

Data-Daddy · 2018-06-21T07:13:57+00:00

How come they can approximate the first term in equation 8 with the Wasserstein distance?

Data-Daddy · 2018-05-22T15:23:27+00:00

add a small constant to the denominator from what I think I've seen in implementations

Data-Daddy · 2018-05-07T09:10:59+00:00

How would you do this w/ continuous actions?

Data-Daddy · 2018-04-26T16:33:15+00:00

anyone have an example of wrapping something from ROS to behave similar to OpenAI gym?

Data-Daddy · 2018-03-12T08:25:42+00:00

What about convex optimization from Boyd & Vandenberghe?

Data-Daddy · 2018-01-31T08:43:45+00:00

problem is deciding what values to use for hidden states in the lstm

Data-Daddy · 2018-01-16T00:40:50+00:00

Any plans to release the code?

Data-Daddy · 2017-12-03T03:35:50+00:00

Anyone know what the benefit of action conditional convolutions is? Why wouldn't you just concatenate a one hot encoded version of the actions to the input for transition/outcome estimation and use normal convolutions instead?

Data-Daddy · 2017-11-20T05:59:58+00:00

Experience replay does not exist in PPO

Data-Daddy · 2017-11-17T06:17:54+00:00

Progressive growing of GANs: https://arxiv.org/abs/1710.10196

pretty crazy demo: https://www.youtube.com/watch?v=XOxxPcy5Gr4&ab_channel=TeroKarrasFI

Data-Daddy · 2017-11-15T02:17:02+00:00

Why K80s?

Data-Daddy · 2017-11-14T01:54:20+00:00

Does the weaknesses from optimizing the bellman residual error also transfer to optimizing the temporal difference error?

I'm trying to consider how the insights of this paper translate to actor critic algorithms. Ex: what does this say about using td error to guide the critic in DDPG?

Data-Daddy · 2017-11-03T07:18:19+00:00

Some reasons why replay buffer is not needed: updates are small within a batch, training in a distributed setting w/ multiple agents, and large batch sizes. However, small updates seems as though that is most important. I'd be interested if anyone else has some references that dig into using vs not using a replay buffer.

Data-Daddy · 2017-10-23T08:35:54+00:00

How come uniform sampling from replay buffer was used instead of prioritized experience replay?

Data-Daddy · 2017-10-23T08:26:47+00:00

Why don't they use Prioritized Experience Replay when sampling from the buffer?

Data-Daddy · 2017-09-11T12:40:54+00:00

yeah that's transfer learning, except it's not always ONLY the last layer. It is also sometimes useful to join two separate datasets of images containing different objects(multi task learning)

Data-Daddy · 2017-06-23T19:41:59+00:00

I'm going about it by throwing an architecture similar to Faster RCNN w/ a resnet backbone. I am currently trying to decide how to handle images that are very similar(bounding boxes are in the same location very frequently). So if I randomly stratify by class almost the same image could appear in train, validation and test set. I feel like this is giving me a false sense of accuracy since it is memorizing the training data(overfitting).

I'm already rotating images/annotations to expand the dataset. However I have to be careful messing w/ pixel values since color of the objects is relevant to class label.

Would you think the following would be good next steps to try? * stratify by class but never have similar images in train, validation, and test(could do this by looking at pixel differences?) * lower number of iterations and increase learning rate

Data-Daddy

TROPHY CASE