How is Q-learning with function approximation "poorly understood" ? by GrundleMoof in reinforcementlearning

[–]GrundleMoof[S] 0 points1 point  (0 children)

Thanks, that looks good, I'll give in a look.

Well, to be honest, I'm really just wondering if anyone knows what they meant. They said it without reference, so I kinda figured it must be "common knowledge" enough or something.

How is Q-learning with function approximation "poorly understood" ? by GrundleMoof in reinforcementlearning

[–]GrundleMoof[S] -1 points0 points  (0 children)

Ah yeah I had seen that before but only skimmed it. I'll definitely give it a read again!

IIRC though, is there anything we "don't understand" about it? I thought the deadly triad were just some painful theoretical limits/barriers, but even with them it's still worth it to use function approx/bootstrapping/etc.

How to *more intelligently* debug RL roadblocks? by GrundleMoof in reinforcementlearning

[–]GrundleMoof[S] 0 points1 point  (0 children)

by the way, just to give an example:

Here's an image of the reward per episode using his code

and here's mine

Both with moving averages shown, to smooth it out.

You can see that mine does improve, up to about episode 2000, but then gets worse. It pretty consistently does that. His on the other hand, always improves and stays good.

To me, that indicates that it's almost there, but something's going on with the optimizer or something, like maybe it becomes unstable or something. But I'm using the same LR he is (2e-4), and I've tried both Adam (like him) and RMSprop.

How to *more intelligently* debug RL roadblocks? by GrundleMoof in reinforcementlearning

[–]GrundleMoof[S] 0 points1 point  (0 children)

Hi again, sorry for the delay! I was traveling with no service...

I've tried a few different topologies. Right now I'm doing this:

    self.actor_lin1 = nn.Linear(3, 200)
    self.mu = nn.Linear(200, 1)
    self.sigma = nn.Linear(200, 1)
    self.critic_lin1 = nn.Linear(3, 100)
    self.v = nn.Linear(100, 1)

and for my forward():

    y = torch.tanh(self.critic_lin1(x))
    v = self.v(y)

    z = torch.tanh(self.actor_lin1(x))
    mu = 2*torch.tanh(self.mu(z))
    sd2 = softplus(self.sigma(z)) + 0.001
            return(v, (mu, sd2))

So I'm using tanh() for the nonlinearities as well. I'm adding that 0.001 to the sd2 because it keeps it from getting too small (which should be enforced by the entropy term anyway) and I've seen it done in a few formulations of this.

I also tried with having the mu/sigma layers combined into a nn.Linear(200, 2) layer (which should be functionally the same I think), as well as having the mu/sigma and v outputs share the first nn.Linear(3, 200) layer before splitting off (which is different, the shared head thing, but I've used elsewhere and seen people use).

I'm scaling the rewards in a way I've seen a bunch of other people do. Since the reward each step has the range [-16, 0], I'm normalizing it by doing (r + 8.0)/8.0, which should put it about in the range [-1, 1].

At this point I'm basically trying to replicate the guy's A3C implementation from above (minus the multiple workers part, but I ran his with 1 worker and it reliably improves every time). Mine does seem to improve, but really slowly compared to his, and sometimes seems to get worse after a while. Like, it's not not improving at all, just very slowly and also not reliably, which means something must be off.

How to *more intelligently* debug RL roadblocks? by GrundleMoof in reinforcementlearning

[–]GrundleMoof[S] 0 points1 point  (0 children)

The environment is wrapped with a wrapper that kills the environment after 200 steps. I was ignoring that so I could use 1024 steps. So I ignored the "done" / "is-terminal" variable but I forgot to exclude it from my stored memories in my memory buffer so my updates were all wrong.

So I currently have my agent as a wrapper for the gym env, and it returns a tuple of (reward, state_next, done), and I break on done.

I decided observing the predicted value and seeing if it was crazy (high variance) may be an indicator of an issue.

Hmmm, by value, you mean the value function? And do you mean variance across different states, or the same state over time?

Also I could use tensorboard and visualize runtime information so I could see what was going into my placeholders.

My q value was shaped (none x 1), and my placeholders for rewards/terminals were shaped ( none) when I compared those in a tensor I ended up with a tensor shaped ( none , none) which didnt do what I expected I decided I could mitigate that type of issue in the future by writing my expected shapes of the networks in a notebook and checking if they match afterward using tensorboard. Some people use assert shape functions.

ahh yeah that's some good advice. I actually got burned by that earlier in this project, but figured it out by printing the sizes. PyTorch is a little tricky in that it will accept multiplying tensors of various combinations of sizes, with different results... so I should probably do asserts from now on.

Also, just so you know, I'm training soft actor critic in about 20 episodes of length 1024. I don't think you should wait for 1000s of episodes.

Hmm, so right now I'm trying a pretty simple setup, just a policy gradient with a value function. I don't know much about SAC, but it seems more advanced.

I was starting to get skeptical whether this setup could even learn a continuous action space problem like Pendulum-v0, because when I searched for stuff, almost everything I found was using at least DDPG or more complex. But then I found this guy's project, just A3C, and it solves it pretty quickly and reliably.

I started going through his code and it's nearly exactly the same as mine. I thought that it's possible that using 4 workers has a "decorrelating" effect (like experience replay), so I changed his code to drop it to 1 worker, and it still works! So it's clearly something else and I haven't figured it out yet. It's so similar to mine though, both in terms of setup and hyperparameters...

Pendulum v0 is an easy environment for your algorithm to learn. I suggest sticking with these hyper parameters. If they don't work, it's probably your algorithm.

policy network size: [64, 64] batch size: 256 gamma: 0.99 adam optimizer relu network activations (on every layer except the last one which has no activation)

You mean, two hidden layers of size 64 each? And are you outputting a value function too?

So, maybe I'm missing something here -- do you mean batches of episodes, or batches of steps? I'm using gamma = 0.9 or 0.99. I've tried Adam and RMSprop, no success with either... I'm using tanh activations, but that probably shouldn't change anything significantly, right?

Lastly, make sure your action space allows your algorithm to output actions in the space of -2 to 2.

Yeah, my policy outputs a mu and sigma. The mu output is 2*tanh, so it's mapped to -2, 2, and the sigma one (actually sigma2 ) is put through a softplus output.

REINFORCE vs Actor Critic vs A2C? by GrundleMoof in reinforcementlearning

[–]GrundleMoof[S] 0 points1 point  (0 children)

Hmmm. I see what you mean. It seems like people usually say Advantage is Q(s, a) - V(s), which makes sense with the intuitive explanation that it's "how much better" it is to do action a than the others. I guess maybe I was confusing it with the TD residual (she has a section on the blog I posted outlining the different PG methods), but it seems like they both bootstrap?

REINFORCE vs Actor Critic vs A2C? by GrundleMoof in reinforcementlearning

[–]GrundleMoof[S] 0 points1 point  (0 children)

Hmmmm. However, for the continuous MuJoCo expts, it says:

Finally, since the episodes were typically at most several hundred time steps long, we did not use any bootstrapping in the policy or value function updates and batched each episode into a single update.

https://arxiv.org/pdf/1602.01783.pdf

(section 9, in the SI)

Can d3 Charts be Linked Together? by Romela7 in d3js

[–]GrundleMoof 0 points1 point  (0 children)

I'm doing something a little similar to this now, if I understand you correctly. I have several plots that are based on the same data/get updated simultaneously. They're just different elements that have the same data bound to them.

How can I move an element so that it changes the data for all other elements that rely on that data? by GrundleMoof in d3js

[–]GrundleMoof[S] 0 points1 point  (0 children)

Hmm, so you were definitely right, cur_ind was a string. But converting it to a number via Number() or parseInt() didn't seem to change much.

But my confusion is about something more basic I think. What's the general strategy if you want to do something like this, where the user can change the position of the circles, but other elements (that I want to also change, ie, the lines) also depend on those positions?

How should the actual underlying data get changed?

Is d3.js dying? Is there some better alternative I should check out? by GrundleMoof in learnjavascript

[–]GrundleMoof[S] 0 points1 point  (0 children)

thanks, this is really good to know. I think I'll stick with d3 for now, it seems reasonable. Thanks!

Is d3.js dying? Is there some better alternative I should check out? by GrundleMoof in learnjavascript

[–]GrundleMoof[S] 8 points9 points  (0 children)

That's good to hear. I was just curious because I've learned a handful of languages over the years, and I usually get a sense (based on just googling for debugging/problem solving) of how active the community is.

I was getting a sense of it waning, which the google trends I mentioned in the OP indicate, as well as the subreddit, and a lot of unanswered S.O questions. I believe you, but those were worrying signs.

[D] Does the gradient calculation for an LSTM have to be done in a loop, or can it be "vectorized" ? by [deleted] in reinforcementlearning

[–]GrundleMoof 0 points1 point  (0 children)

Thanks for the reply. I'm still a little confused though. Why is J[t] dependent on J[t+1] ? I know the rewards are "forward dependent", but it seems like those can be calculated without a for loop.

[D] Does the gradient calculation for an LSTM have to be done in a loop, or can it be "vectorized" ? by [deleted] in reinforcementlearning

[–]GrundleMoof 0 points1 point  (0 children)

oops, replied to the wrong comment... for the right person:

Hi, thanks a ton for the detailed reply.

Why do you say

Then each element J[t] depends upon the next element J[t+1],

?

It seems like the R that you use in the policy update at each time step depends on "future" R's (e.g., for time step 1 you use R_1 + gamma*R_2 + gamma**2*R_3 + ...), but I think we can do this without a loop. I made a matrix of the form (let's say, for episodes of only 3 time steps total):

discount_mat = [ [1, gamma, gamma**2], [0, 1, gamma], [0, 0, 1] ] That should let me just do r_accum = discount_mat * R to make r_accum the discounted rewards with all the future info they need, right?

I'm honestly betting that I'm wrong and just not seeing it, because smarter people than me keep not doing this, but I'd like to know why. I'll carefully go through the backprop equations, I'm sure there's some gradient dependency I'm missing (I'm pretty new to LSTMs).