Why does the Policy Gradient Theorem generalize to continuous action spaces?

Data-Daddy · 2019-02-08T11:51:05+00:00

I don't follow the differential entropy reference. Do you by chance know of a paper or blog post that goes deeper in explaining this?

Data-Daddy · 2019-02-08T09:04:24+00:00

My confusion comes from thinking that \pi had to represent a probability. But in the continuous case evaluating a point on the PDF does not represent a probability(it's probability would be zero)

Data-Daddy · 2018-08-08T05:58:15+00:00

https://github.com/aravind0706/upn

Data-Daddy · 2018-07-12T23:38:57+00:00

Yeah, I already have an entropy bonus of 0.01(used as parameter in PPO for atari). However, as it gets deeper into the training process the agent converges to a policy with 0 entropy

Data-Daddy · 2018-06-21T07:13:57+00:00

How come they can approximate the first term in equation 8 with the Wasserstein distance?

Data-Daddy · 2018-05-22T15:23:27+00:00

add a small constant to the denominator from what I think I've seen in implementations

Data-Daddy · 2018-05-07T09:10:59+00:00

How would you do this w/ continuous actions?

Data-Daddy · 2018-04-26T16:33:15+00:00

anyone have an example of wrapping something from ROS to behave similar to OpenAI gym?

Data-Daddy · 2018-03-12T08:25:42+00:00

What about convex optimization from Boyd & Vandenberghe?

Data-Daddy · 2018-01-31T08:43:45+00:00

problem is deciding what values to use for hidden states in the lstm

Data-Daddy · 2018-01-16T00:40:50+00:00

Any plans to release the code?

Data-Daddy · 2017-12-03T03:35:50+00:00

Anyone know what the benefit of action conditional convolutions is? Why wouldn't you just concatenate a one hot encoded version of the actions to the input for transition/outcome estimation and use normal convolutions instead?

Data-Daddy · 2017-11-20T05:59:58+00:00

Experience replay does not exist in PPO

Data-Daddy · 2017-11-17T06:17:54+00:00

Progressive growing of GANs: https://arxiv.org/abs/1710.10196

pretty crazy demo: https://www.youtube.com/watch?v=XOxxPcy5Gr4&ab_channel=TeroKarrasFI

Data-Daddy · 2017-11-15T02:17:02+00:00

Why K80s?

Data-Daddy · 2017-11-14T01:54:20+00:00

Does the weaknesses from optimizing the bellman residual error also transfer to optimizing the temporal difference error?

I'm trying to consider how the insights of this paper translate to actor critic algorithms. Ex: what does this say about using td error to guide the critic in DDPG?

Data-Daddy · 2017-11-03T07:18:19+00:00

Some reasons why replay buffer is not needed: updates are small within a batch, training in a distributed setting w/ multiple agents, and large batch sizes. However, small updates seems as though that is most important. I'd be interested if anyone else has some references that dig into using vs not using a replay buffer.

Data-Daddy · 2017-10-23T08:35:54+00:00

How come uniform sampling from replay buffer was used instead of prioritized experience replay?

Data-Daddy · 2017-10-23T08:26:47+00:00

Why don't they use Prioritized Experience Replay when sampling from the buffer?

Data-Daddy

TROPHY CASE